[GitHub] [hudi] hudi-bot removed a comment on pull request #4503: [HUDI-2438] [RFC-34] Added the implementation details for the BigQuery integration

2022-01-03 Thread GitBox


hudi-bot removed a comment on pull request #4503:
URL: https://github.com/apache/hudi/pull/4503#issuecomment-1004558109


   
   ## CI report:
   
   * 6c79ce6e2ff6eec244d11d35f3556fd5356803cc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4878)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4503: [HUDI-2438] [RFC-34] Added the implementation details for the BigQuery integration

2022-01-03 Thread GitBox


hudi-bot commented on pull request #4503:
URL: https://github.com/apache/hudi/pull/4503#issuecomment-1004580607


   
   ## CI report:
   
   * 6c79ce6e2ff6eec244d11d35f3556fd5356803cc Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4878)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] melin opened a new issue #4504: [SUPPORT] Support analyze sql

2022-01-03 Thread GitBox


melin opened a new issue #4504:
URL: https://github.com/apache/hudi/issues/4504


   Support analyze sql: 
https://spark.apache.org/docs/latest/sql-ref-syntax-aux-analyze-table.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codope commented on issue #4154: [SUPPORT] INSERT OVERWRITE operation does not work when using Spark SQL

2022-01-03 Thread GitBox


codope commented on issue #4154:
URL: https://github.com/apache/hudi/issues/4154#issuecomment-1004559852


   @BenjMaq How did you sync to hive? My hunch is something went wrong while 
hive syncing. Using latest master and the script that @YannByron shared, I ran 
hive sync and could see the latest data from both hive and presto.
   Gist: https://gist.github.com/codope/d42a6fb85ef081b3bcd94d4897a4481d
   Hoodie Timeline:
   ```
   root@adhoc-1:/opt# hadoop fs -ls /tmp/hudi/test_overwrite1/.hoodie
   Found 11 items
   drwxr-xr-x   - root supergroup  0 2022-01-04 06:20 
/tmp/hudi/test_overwrite1/.hoodie/.aux
   drwxr-xr-x   - root supergroup  0 2022-01-04 06:21 
/tmp/hudi/test_overwrite1/.hoodie/.temp
   -rw-r--r--   1 root supergroup   1791 2022-01-04 06:21 
/tmp/hudi/test_overwrite1/.hoodie/20220104062057294.commit
   -rw-r--r--   1 root supergroup  0 2022-01-04 06:20 
/tmp/hudi/test_overwrite1/.hoodie/20220104062057294.commit.requested
   -rw-r--r--   1 root supergroup   1194 2022-01-04 06:21 
/tmp/hudi/test_overwrite1/.hoodie/20220104062057294.inflight
   -rw-r--r--   1 root supergroup   1907 2022-01-04 06:21 
/tmp/hudi/test_overwrite1/.hoodie/20220104062117644.replacecommit
   -rw-r--r--   1 root supergroup   1204 2022-01-04 06:21 
/tmp/hudi/test_overwrite1/.hoodie/20220104062117644.replacecommit.inflight
   -rw-r--r--   1 root supergroup  0 2022-01-04 06:21 
/tmp/hudi/test_overwrite1/.hoodie/20220104062117644.replacecommit.requested
   drwxr-xr-x   - root supergroup  0 2022-01-04 06:20 
/tmp/hudi/test_overwrite1/.hoodie/archived
   -rw-r--r--   1 root supergroup985 2022-01-04 06:20 
/tmp/hudi/test_overwrite1/.hoodie/hoodie.properties
   drwxr-xr-x   - root supergroup  0 2022-01-04 06:21 
/tmp/hudi/test_overwrite1/.hoodie/metadata
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4503: [HUDI-2438] [RFC-34] Added the implementation details for the BigQuery integration

2022-01-03 Thread GitBox


hudi-bot commented on pull request #4503:
URL: https://github.com/apache/hudi/pull/4503#issuecomment-1004558109


   
   ## CI report:
   
   * 6c79ce6e2ff6eec244d11d35f3556fd5356803cc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4878)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4503: [HUDI-2438] [RFC-34] Added the implementation details for the BigQuery integration

2022-01-03 Thread GitBox


hudi-bot removed a comment on pull request #4503:
URL: https://github.com/apache/hudi/pull/4503#issuecomment-1004557125


   
   ## CI report:
   
   * 6c79ce6e2ff6eec244d11d35f3556fd5356803cc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on a change in pull request #4141: [HUDI-2815] Support partial update for streaming change logs

2022-01-03 Thread GitBox


danny0405 commented on a change in pull request #4141:
URL: https://github.com/apache/hudi/pull/4141#discussion_r777805857



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateWithLatestAvroPayload.java
##
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.hudi.common.util.Option;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Objects;
+import java.util.Properties;
+
+import static org.apache.hudi.avro.HoodieAvroUtils.bytesToAvro;
+
+/**
+ * The only difference with {@link DefaultHoodieRecordPayload} is that support 
update partial fields
+ * in latest record which value is not null to existing record instead of all 
fields.
+ *
+ *  Assuming a {@link GenericRecord} has three fields: a int , b int, c 
int. The first record value: 1, 2, 3.
+ * The second record value is: 4, 5, null, the field c value is null. After 
call the combineAndGetUpdateValue method,
+ * we will get final record value: 4, 5, 3, field c value will not be 
overwritten because its value is null in latest record.
+ */

Review comment:
   ```java
   /**
* The only difference with {@link 
OverwriteNonDefaultsWithLatestAvroPayload} is that it supports
* merging the latest non-null partial fields with the old record instead of 
replacing the whole record.
*
*  Assuming a {@link GenericRecord} has row schema: (f0 int , f1 int, f2 
int).
* The first record value is: (1, 2, 3), the second record value is: (4, 5, 
null) with the field c value as null.
* Calling the #combineAndGetUpdateValue method of the two records returns 
record: (4, 5, 3).
* Note that field c value is ignored because it is null.
*/
   ```

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateWithLatestAvroPayload.java
##
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.hudi.common.util.Option;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Objects;
+import java.util.Properties;
+
+import static org.apache.hudi.avro.HoodieAvroUtils.bytesToAvro;
+
+/**
+ * The only difference with {@link DefaultHoodieRecordPayload} is that support 
update partial fields
+ * in latest record which value is not null to existing record instead of all 
fields.
+ *
+ *  Assuming a {@link GenericRecord} has three fields: a int , b int, c 
int. The first record value: 1, 2, 3.
+ * The second record value is: 4, 5, null, the field c value is null. After 
call the combineAndGetUpdateValue method,
+ * we will get final record value: 4, 5, 3, field c value will not be 
overwritten because its value is null in latest record.
+ */
+public class PartialUpdateWithLatestAvroPayload extends 
DefaultHoodieRecordPayload {
+

Review comment:
   We can extend from `OverwriteNonDefaultsWithLatestAvroPayload` instead 
of `DefaultHoodieRecordPayload `, and we should override `#preCombine` instead 
of `combineAndGetUpdateValue `.




-- 
This is an automated message from the Apache Git Service.
To respond to 

[GitHub] [hudi] hudi-bot commented on pull request #4503: [HUDI-2438] [RFC-34] Added the implementation details for the BigQuery integration

2022-01-03 Thread GitBox


hudi-bot commented on pull request #4503:
URL: https://github.com/apache/hudi/pull/4503#issuecomment-1004557125


   
   ## CI report:
   
   * 6c79ce6e2ff6eec244d11d35f3556fd5356803cc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync

2022-01-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-2438:
-
Labels: BigQuery Integration pull-request-available  (was: BigQuery 
Integration)

> [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync
> 
>
> Key: HUDI-2438
> URL: https://issues.apache.org/jira/browse/HUDI-2438
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: Common Core
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Blocker
>  Labels: BigQuery, Integration, pull-request-available
> Fix For: 0.11.0
>
>
> BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective 
> analytics data warehouse that lets you run analytics over vast amounts of 
> data in near real-time. BigQuery currently [doesn’t 
> support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache 
> Hudi file format, but it has support for the Parquet file format. The 
> proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi 
> table as the BigQuery External Parquet table so that users can query the Hudi 
> tables using BigQuery. Uber is already syncing some of its Hudi tables to 
> BigQuery data mart this will help them to write, sync, and query.
>  
> More details are in RFC-34: 
> [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] vingov opened a new pull request #4503: [HUDI-2438] [RFC-34] Added the implementation details for the BigQuery integration

2022-01-03 Thread GitBox


vingov opened a new pull request #4503:
URL: https://github.com/apache/hudi/pull/4503


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *This pull request adds the implementation details for the Hudi BigQuery 
integration RFC-34.*
   
   ## Brief change log
   
   *(for example:)*
 - *RFC-34 Hudi BigQuery Integration details were updated.*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [x] Has a corresponding JIRA in PR title & commit

- [x] Commit message is descriptive of the change

- [x] CI is green
   
- [x] Necessary doc changes done or have another open PR
  
- [x] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codope commented on issue #4474: [SUPPORT] Should we shade all aws dependencies to avoid class conflicts?

2022-01-03 Thread GitBox


codope commented on issue #4474:
URL: https://github.com/apache/hudi/issues/4474#issuecomment-1004546342


   @boneanxs Shading is fine. Do consider adding a new profile so that users 
can build according to their use case.
   cc @umehrot2  for more inputs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] a0x commented on issue #4442: [SUPPORT] PySpark(3.1.2) with Hudi(0.10.0) failed when querying spark sql

2022-01-03 Thread GitBox


a0x commented on issue #4442:
URL: https://github.com/apache/hudi/issues/4442#issuecomment-1004539542


   @kazdy I did recompile Hudi packages as the mentioned config, yet the error 
remains.
   
   This is an interesting problem, because all things good in `spark-shell`, 
yet the problem occues **only in PySpark**.
   
   So I think the lib confliction is hidden in the diff between `spark-shell` 
and `pyspark`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3153) Make Trino connector implementation extensible for different table/query types, data formats, etc.

2022-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3153:

Sprint: Hudi-Sprint-Jan-3

> Make Trino connector implementation extensible for different table/query 
> types, data formats, etc.
> --
>
> Key: HUDI-3153
> URL: https://issues.apache.org/jira/browse/HUDI-3153
> Project: Apache Hudi
>  Issue Type: Task
>  Components: trino
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2695) Documentation

2022-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-2695:

Sprint: Hudi-Sprint-Jan-3

> Documentation
> -
>
> Key: HUDI-2695
> URL: https://issues.apache.org/jira/browse/HUDI-2695
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Kyle Weller
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3097) Address dependency issue with hudi-trino-bundle in connector

2022-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3097:

Sprint: Hudi-Sprint-Jan-3

> Address dependency issue with hudi-trino-bundle in connector
> 
>
> Key: HUDI-3097
> URL: https://issues.apache.org/jira/browse/HUDI-3097
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2724) Benchmark connector

2022-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-2724:

Sprint: Hudi-Sprint-Jan-3

> Benchmark connector
> ---
>
> Key: HUDI-2724
> URL: https://issues.apache.org/jira/browse/HUDI-2724
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> For all different scenarios performance should be on par or better than plain 
> parquet tables in Hive.
> 1. Non-partitioned big data set.
> 2. Large number of partitions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[hudi] branch master updated (29ab6fb -> 7329d22)

2022-01-03 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 29ab6fb  [HUDI-3140] Fix bulk_insert failure on Spark 3.2.0 (#4498)
 add 7329d22  Adding tests to validate different key generators (#4473)

No new revisions were added by this update.

Summary of changes:
 .../hudi/functional/TestCOWDataSourceStorage.scala | 85 --
 1 file changed, 61 insertions(+), 24 deletions(-)


[GitHub] [hudi] codope merged pull request #4473: [HUDI-2590] Adding tests to validate different key generators

2022-01-03 Thread GitBox


codope merged pull request #4473:
URL: https://github.com/apache/hudi/pull/4473


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3160) Column Stats index should use the same column for the index key in the write and read code path

2022-01-03 Thread Manoj Govindassamy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Govindassamy updated HUDI-3160:
-
Sprint: Hudi-Sprint-Jan-3

> Column Stats index should use the same column for the index key in the write 
> and read code path
> ---
>
> Key: HUDI-3160
> URL: https://issues.apache.org/jira/browse/HUDI-3160
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3160) Column Stats index should use the same column for the index key in the write and read code path

2022-01-03 Thread Manoj Govindassamy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Govindassamy updated HUDI-3160:
-
Fix Version/s: 0.11.0

> Column Stats index should use the same column for the index key in the write 
> and read code path
> ---
>
> Key: HUDI-3160
> URL: https://issues.apache.org/jira/browse/HUDI-3160
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2735) Fix archival of commits in Java client for Kafka Connect

2022-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-2735:

Fix Version/s: 0.10.1

> Fix archival of commits in Java client for Kafka Connect
> 
>
> Key: HUDI-2735
> URL: https://issues.apache.org/jira/browse/HUDI-2735
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0, 0.10.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3007) Address minor feedbacks on the repair utility

2022-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3007:

Fix Version/s: 0.10.1

> Address minor feedbacks on the repair utility
> -
>
> Key: HUDI-3007
> URL: https://issues.apache.org/jira/browse/HUDI-3007
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0, 0.10.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] codope commented on a change in pull request #4473: [HUDI-2590] Adding tests to validate different key generators

2022-01-03 Thread GitBox


codope commented on a change in pull request #4473:
URL: https://github.com/apache/hudi/pull/4473#discussion_r777829954



##
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSourceStorage.scala
##
@@ -100,8 +120,26 @@ class TestCOWDataSourceStorage extends 
SparkClientFunctionalTestHarness {
 assertEquals(updatedVerificationVal, snapshotDF2.filter(col("_row_key") 
=== verificationRowKey).select(verificationCol).first.getString(0))
 
 // Upsert Operation without Hudi metadata columns
-val records2 = recordsToStrings(dataGen.generateUpdates("001", 100)).toList
-val inputDF2 = spark.read.json(spark.sparkContext.parallelize(records2 , 
2))
+val records2 = recordsToStrings(dataGen.generateUpdates("002", 100)).toList
+var inputDF2 = spark.read.json(spark.sparkContext.parallelize(records2, 2))
+
+if (classOf[TimestampBasedKeyGenerator].getName.equals(keyGenClass)) {
+  // incase of Timestamp based key gen, current_ts should not be updated. 
but dataGen.generateUpdates() would have updated

Review comment:
   Sounds good. Will land this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3160) Column Stats index should use the same column for the index key in the write and read code path

2022-01-03 Thread Manoj Govindassamy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Govindassamy updated HUDI-3160:
-
Issue Type: Task  (was: Bug)

> Column Stats index should use the same column for the index key in the write 
> and read code path
> ---
>
> Key: HUDI-3160
> URL: https://issues.apache.org/jira/browse/HUDI-3160
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3144) Parallelize metadata table getRecordsByKeys() operations

2022-01-03 Thread Manoj Govindassamy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Govindassamy reassigned HUDI-3144:


Assignee: Manoj Govindassamy

> Parallelize metadata table getRecordsByKeys() operations
> 
>
> Key: HUDI-3144
> URL: https://issues.apache.org/jira/browse/HUDI-3144
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Critical
> Fix For: 0.11.0
>
>
> When the metadata index lookups is done for several thousands of keys, the 
> keys are looked up in metadata table partitions in a serial fashion key by 
> key, leading to the overall delay in the index lookup.
>  # When the indexes are laid out in multiple file groups, the lookup can be 
> parallelized at the filegroup level
>  # Even within a single filegroup, sorted keys can be split and the lookup 
> can be done in parallel. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3143) Support multiple file groups for metadata table index partitions

2022-01-03 Thread Manoj Govindassamy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Govindassamy reassigned HUDI-3143:


Assignee: Manoj Govindassamy

> Support multiple file groups for metadata table index partitions
> 
>
> Key: HUDI-3143
> URL: https://issues.apache.org/jira/browse/HUDI-3143
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Critical
> Fix For: 0.11.0
>
>
> Metadata table index partitions today have only one file group. This is ok 
> for the files partition as the number of records don't increase linearly with 
> the data table records. But the newer partitions like bloom_filters and 
> col_stats would have records that would grow linearly with the data table 
> files and records. We need the metadata table to support multiple file groups.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3160) Column Stats index should use the same column for the index key in the write and read code path

2022-01-03 Thread Manoj Govindassamy (Jira)
Manoj Govindassamy created HUDI-3160:


 Summary: Column Stats index should use the same column for the 
index key in the write and read code path
 Key: HUDI-3160
 URL: https://issues.apache.org/jira/browse/HUDI-3160
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Manoj Govindassamy
Assignee: Manoj Govindassamy






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3142) Metadata new Indices initialization during table creation

2022-01-03 Thread Manoj Govindassamy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Govindassamy reassigned HUDI-3142:


Assignee: Manoj Govindassamy

> Metadata new Indices initialization during table creation 
> --
>
> Key: HUDI-3142
> URL: https://issues.apache.org/jira/browse/HUDI-3142
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Critical
> Fix For: 0.11.0
>
>
> Metadata table when created for the first time, checks if index 
> initialization is needed by comparing with the data table timeline. Today the 
> initialization only takes care of metadata files partition. We need to do 
> similar initialization for all the new index partitions - bloom_filters, 
> col_stats.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2740) Support for snapshot querying on MOR table

2022-01-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2740:
--
Priority: Major  (was: Blocker)

> Support for snapshot querying on MOR table
> --
>
> Key: HUDI-2740
> URL: https://issues.apache.org/jira/browse/HUDI-2740
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2694) Support for ORC format

2022-01-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2694:
--
Priority: Major  (was: Blocker)

> Support for ORC format
> --
>
> Key: HUDI-2694
> URL: https://issues.apache.org/jira/browse/HUDI-2694
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2735) Fix archival of commits in Java client for Kafka Connect

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2735:
-
Status: In Progress  (was: Open)

> Fix archival of commits in Java client for Kafka Connect
> 
>
> Key: HUDI-2735
> URL: https://issues.apache.org/jira/browse/HUDI-2735
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3007) Address minor feedbacks on the repair utility

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3007:
-
Status: In Progress  (was: Open)

> Address minor feedbacks on the repair utility
> -
>
> Key: HUDI-3007
> URL: https://issues.apache.org/jira/browse/HUDI-3007
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2693) Support for incremental queries on MOR table

2022-01-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2693:
--
Priority: Major  (was: Blocker)

> Support for incremental queries on MOR table
> 
>
> Key: HUDI-2693
> URL: https://issues.apache.org/jira/browse/HUDI-2693
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2690) Support for read optimized queries on MOR table

2022-01-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2690:
--
Status: Patch Available  (was: In Progress)

> Support for read optimized queries on MOR table
> ---
>
> Key: HUDI-2690
> URL: https://issues.apache.org/jira/browse/HUDI-2690
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2690) Support for read optimized queries on MOR table

2022-01-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2690:
--
Status: Resolved  (was: Patch Available)

> Support for read optimized queries on MOR table
> ---
>
> Key: HUDI-2690
> URL: https://issues.apache.org/jira/browse/HUDI-2690
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2690) Support for read optimized queries on MOR table

2022-01-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2690:
--
Status: In Progress  (was: Open)

> Support for read optimized queries on MOR table
> ---
>
> Key: HUDI-2690
> URL: https://issues.apache.org/jira/browse/HUDI-2690
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2692) Support for incremental queries on COW table

2022-01-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2692:
--
Priority: Major  (was: Blocker)

> Support for incremental queries on COW table
> 
>
> Key: HUDI-2692
> URL: https://issues.apache.org/jira/browse/HUDI-2692
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] a0x closed issue #4442: [SUPPORT] PySpark(3.1.2) with Hudi(0.10.0) failed when querying spark sql

2022-01-03 Thread GitBox


a0x closed issue #4442:
URL: https://github.com/apache/hudi/issues/4442


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] a0x edited a comment on issue #4442: [SUPPORT] PySpark(3.1.2) with Hudi(0.10.0) failed when querying spark sql

2022-01-03 Thread GitBox


a0x edited a comment on issue #4442:
URL: https://github.com/apache/hudi/issues/4442#issuecomment-1004486507


   > I have the same issue when running hudi on emr. This issue seems to have 
the same root cause as in this one: #4474 . The solution is to shade and 
relocate aws dependencies introduced in hudi-aws:
   > 
   > > For our internal hudi version, we shade aws dependencies, you can add 
new relocation and build a new bundle package:
   > > For example, to shade aws dependencies in spark, add following codes in 
**packaging/hudi-spark-bundle/pom.xml**
   > > ```
   > > 
   > > 
   > >  com.amazonaws.
   > >  
${spark.bundle.spark.shade.prefix}com.amazonaws.
   > > 
   > > ```
   > 
   > @xushiyan should this relocation be added to the official hudi release to 
avoid such conflicts?
   
   @kazdy Thank you! This should work.
   
   But shall we shade all aws deps in Spark? I'm worrying about the side 
effict, but let me have a try before replying in #4474 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2948) Hudi Clustering Performance

2022-01-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2948:
--
Summary: Hudi Clustering Performance  (was: [UMBRELLA] Hudi Clustering 
Performance)

> Hudi Clustering Performance
> ---
>
> Key: HUDI-2948
> URL: https://issues.apache.org/jira/browse/HUDI-2948
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: performance
> Fix For: 0.11.0
>
>
> This is an umbrella task for the effort of improving Hudi's Clustering 
> performance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2948) [UMBRELLA] Hudi Clustering Performance

2022-01-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2948:
--
Issue Type: Test  (was: Epic)

> [UMBRELLA] Hudi Clustering Performance
> --
>
> Key: HUDI-2948
> URL: https://issues.apache.org/jira/browse/HUDI-2948
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: performance
> Fix For: 0.11.0
>
>
> This is an umbrella task for the effort of improving Hudi's Clustering 
> performance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2948) [UMBRELLA] Hudi Clustering Performance

2022-01-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2948:
--
Epic Link: HUDI-1042

> [UMBRELLA] Hudi Clustering Performance
> --
>
> Key: HUDI-2948
> URL: https://issues.apache.org/jira/browse/HUDI-2948
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: performance
> Fix For: 0.11.0
>
>
> This is an umbrella task for the effort of improving Hudi's Clustering 
> performance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-2711) Fallback to full table scan for IncrementalRelation and HoodieIncrSource when data file is missing.

2022-01-03 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-2711:


Assignee: Jagmeet Bali

> Fallback to full table scan for IncrementalRelation and HoodieIncrSource when 
> data file is missing.
> ---
>
> Key: HUDI-2711
> URL: https://issues.apache.org/jira/browse/HUDI-2711
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Jagmeet Bali
>Assignee: Jagmeet Bali
>Priority: Minor
>  Labels: pull-request-available, query-eng, sev:critical
> Fix For: 0.11.0, 0.10.1
>
>
> Fallback to full table scan for incremental readers if underlying file has 
> been moved or deleted due to cleaner.
> For more info https://github.com/apache/hudi/issues/2934



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3126) Address whackamoles during testing of Hudi Trino connector

2022-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3126:

Status: Resolved  (was: Patch Available)

> Address whackamoles during testing of Hudi Trino connector
> --
>
> Key: HUDI-3126
> URL: https://issues.apache.org/jira/browse/HUDI-3126
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2950) Address high small objects churn in Bulk Insert/Layout Optimization

2022-01-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2950:
--
Epic Link: HUDI-1042  (was: HUDI-2948)

> Address high small objects churn in Bulk Insert/Layout Optimization
> ---
>
> Key: HUDI-2950
> URL: https://issues.apache.org/jira/browse/HUDI-2950
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Based on findings in HUDI-2949, following needs to be addressed to reduce 
> pressure on GC, and improve performance: 
>  * Remove unnecessary `ArrayList` resizing (during Hilbert Curve mapping)
>  * Avoid unnecessary boxing (during Hilbert Curve mapping)
>  * (In Parquet) Avoid allocating `ByteBuffer`s in `compareTo` method invoked 
> from `BinaryStatistics.updateStats` method (on every write to Parquet's 
> `ColumnWriterBase`)
>  * Avoid {{bytesToAvro}} / {{avroToBytes}} ser-de loop (due to use of 
> {{{}OverwriteWithLatestAvroPayload{}}}, to be replaced w/ 
> {{{}RewriteAvroPayload{}}})
>  * Avoid re-allocating substrings (caching them) when fetching 
> {{Path.getName}} (from{{ }}{{HoodieWrapperFileSystem.getBytesWritten)}}
>  * Avoid allocating large deques by {{DefaultSizeEstimator.sizeEstimate}} 
> (currently allocates 16 x 1024 default internal `ArrayDeque`) {{ }}{{}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2711) Fallback to full table scan for IncrementalRelation and HoodieIncrSource when data file is missing.

2022-01-03 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2711:
-
Reviewers: sivabalan narayanan

> Fallback to full table scan for IncrementalRelation and HoodieIncrSource when 
> data file is missing.
> ---
>
> Key: HUDI-2711
> URL: https://issues.apache.org/jira/browse/HUDI-2711
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Jagmeet Bali
>Priority: Minor
>  Labels: pull-request-available, query-eng, sev:critical
> Fix For: 0.11.0, 0.10.1
>
>
> Fallback to full table scan for incremental readers if underlying file has 
> been moved or deleted due to cleaner.
> For more info https://github.com/apache/hudi/issues/2934



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2949) Benchmark Clustering performance

2022-01-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2949:
--
Epic Link: HUDI-1042  (was: HUDI-2948)

> Benchmark Clustering performance
> 
>
> Key: HUDI-2949
> URL: https://issues.apache.org/jira/browse/HUDI-2949
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Major
>
> These observations are from running Layout Optimization (Clustering) on a 
> [small Amazon 
> Reviews|https://s3.amazonaws.com/amazon-reviews-pds/readme.html] (4.5Gb, 
> reduced) dataset
> h2. *Major*
>   * GC is taking up to *25%* of CPU cycles (a lot of churn)
>  * 
>  ** A lot of ArrayList resizing like in the code below
>  
> {code:java}
> // Creating empty list, then immediately inserting List values = new 
> ArrayList<>(); 
> values.addAll(JavaConverters.bufferAsJavaListConverter(row.toSeq().toBuffer()).asJava());
>  values.add(hilbertValue);{code}
>  
>  * 
>  ** A lot of avoidable boxing like following
>  *  
> {code:java}
>  Collecting as Longs, then unboxing into longs List longList = 
> fieldMap.entrySet().stream().map(...) byte[] hilbertValue = 
> HilbertCurveUtils.indexBytes( hilbertCurve, longList.stream().mapToLong(l -> 
> l).toArray(), 63);     {code}
>  * Up to *20%* of wall-clock time is spent under locks in 
> BoundedInMemoryExecutor (LinkedBlockingQueue)
>  * ~35% of wall-clock time is spent Gzip-ing the output
> h2. *Per stage*
>  
> Experiment could roughly be broken down into following stages * 
> {_}Bulk-insert{_}: of the raw data into Hudi table
>  * {_}Sorting{_}: re-sorting data according to the Layout Optimization config 
> (reshuffling)
>  * {_}Bulk-insert (of the sorted){_}: bulk inserting reshuffled data
>  
> h4. Bulk Insert
> {_}Memory Allocated Total{_}: 22,000 samples x 500kb (sampling frequency) ~= 
> 11Gb
> {_}GC{_}: 6%
>  
> _Observations_ * *~30%* of CPU is spent on Gzip compression
>  * 
>  ** Created HUDI-2928 to flip default from gzip to zstd
> h4. Sorting
> _Memory Allocated Total:_ 36,000 samples x 500kb (sampling frequency) ~= 
> *18Gb*
> {_}GC{_}: ~6%
>  
> _Observations (Memory)_ * About *16%* allocated by 
> {{BinaryStatistics.updateStats}} in Parquet's {{ColumnWriterBase}}
>  * 
>  ** +Writing to Parquet column as a whole allocates+ {*}+~19%+{*}{+}, ie the 
> actual write allocates only 3% and{+} {*}+80% of it is overhead+{*}{+}.{+}
>  ** Allocating {{HeapByteBuffer}} in {{Binary.toByteBuffer}} w/in 
> {{PrimitiveComparator}} (!!!) accounting min/max values for columns
>  ** Created PARQUET-2106 / 
> [PR#940|https://github.com/apache/parquet-mr/pull/940]
>  * About *18%* is spent on {{bytesToAvro}} / {{avroToBytes}} conversion in 
> calls to
>  ** {{OverwriteWithLatestAvroPayload.getInsertValue}}
>  ** {{OverwriteWithLatestAvroPayload.}}
>  * About 4% is allocated in by fetching {{Path.getName}} 
> {{HoodieWrapperFileSystem.getBytesWritten}}
>  ** Internally Hadoop calls {{path.substring}} allocating new string every 
> time
>  * About *5%* of Memory is allocated by {{DefaultSizeEstimator.sizeEstimate}}
>  ** ~3% is in ctor – instance allocates by default:
>  * private final Deque pending = new ArrayDeque<>(16 * 1024);
>  * Remaining 2% are allocated while traversing the object tree
>  ** Resizing hash-sets
>  ** Fetching methods/fields through reflection (allocates arrays)
>  
> _Observations (CPU)_ * About 30% of time is spent in waiting state under 
> locks w/in {{LinkedBlockingQueue}} in {{BoundedInMemoryQueue}}
>  * About 10% is spent on parsing Spark's {{Row}} in 
> {{HoodieSparkUtils.createRdd}}
>  * About 2% of the CPU wall time spent on Parsing Avro schemas
>  
> h4. Bulk-insert (sorted)
> Memory Allocated (Total): 45,000 samples x 500kb ~= *22Gb*
> GC: *~23%*
>  
> Observations are similar to [unordered 
> bulk-insert|https://app.clickup.com/18029943/v/dc/h67bq-1900/h67bq-5880?block=block-3cfa6bf5-23bd-4e21-8a56-48fcb198b244]
>  
> h2. Profiling
> All profiles for these benchmarks have been taken using 
> [async-profiler|https://github.com/jvm-profiling-tools/async-profiler].
>  
> {code:java}
> # CPU 
> CPU PID=48449;EVENT=itimer;TS=$(date +%s); ./profiler.sh collect -e $EVENT -d 
> 60 -f "profile_${PID}${EVENT}${TS}.html" $PID
> # Memory
> PID=;EVENT=alloc;TS=$(date +%s); ./profiler.sh collect -e $EVENT -d 60 
> -f "profile_${PID}${EVENT}${TS}.html" --alloc 500k $PI
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2723) Add product integration tests

2022-01-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2723:
--
Status: Resolved  (was: Patch Available)

> Add product integration tests
> -
>
> Key: HUDI-2723
> URL: https://issues.apache.org/jira/browse/HUDI-2723
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2694) Support for ORC format

2022-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-2694:

Fix Version/s: (was: 0.11.0)

> Support for ORC format
> --
>
> Key: HUDI-2694
> URL: https://issues.apache.org/jira/browse/HUDI-2694
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3082) [Phase 1] Unify MOR table access across Spark, Hive

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3082:
-
Sprint: Hudi-Sprint-Jan-3

> [Phase 1] Unify MOR table access across Spark, Hive
> ---
>
> Key: HUDI-3082
> URL: https://issues.apache.org/jira/browse/HUDI-3082
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> This is Phase 1 of what outlined in HUDI-3081
>  
> The goal is 
>  * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, 
> {{{}RealtimeUnmergedRecordReader{}}})
>  ** _These Readers should only differ in the way they handle the payload, 
> everything else should remain constant_
>  * Abstract w/in common component (name TBD)
>  ** Listing current file-slices at the requested instant (handling the 
> timeline)
>  ** Creating Record Iterator for the provided file-slice



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2299) The log format DELETE block lose the info orderingVal

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2299:
-
Sprint: Hudi-Sprint-Jan-3

> The log format DELETE block lose the info orderingVal
> -
>
> Key: HUDI-2299
> URL: https://issues.apache.org/jira/browse/HUDI-2299
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Danny Chen
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> The append handle now always write data block first then delete block, and 
> the delete block only keeps the hoodie keys, when reading, the scanner just 
> read the DELETE block without any info of ordering value, thus, if the we 
> write two records:
> insert: {id: 0, ts: 2}
> delete: {id: 0, ts: 1}
> Finally the insert message is deleted !!!, this is a critical bug for 
> streaming write, we should fix it as soon as possible
> _*Here is the discussion on slack*_:
> Danny Chan  12:42 PM
> https://issues.apache.org/jira/browse/HUDI-2299
> 12:43
> Hi, @vc, our user found a critical bug for MOR log format, if there are 
> disorder DELETEs in the streaming messages, the event time of the DELETEs are 
> totally ignored.
> 12:44
> I guess this should be a blocker of 0.9 because it affect the correctness of 
> the data set.
> vc  12:44 PM
> if we can fix it by end of day friday PST
> 12:44
> we can add it
> 12:44
> Just want to cut a release this week.
> 12:45
> Do you have a sense for the fix? bandwidth to take it up?
> Danny Chan  12:46 PM
> I try to fix it but can not figure out a good way, if the DELETE block 
> records the orderingVal, the format breaks the compatibility.
> vc  1:05 PM
> We can version the format. thats doable. Should we precombine before even 
> logging the deeltes?
> Danny Chan  1:11 PM
> Yes, we should
> vc  1:26 PM
> I think, thats how its working today. Deletes don't have an ordering val per 
> se, right
> 1:28
> Delete block at t1 :
>   delete key k
> Data block at t2 :
>   ins key k with ordering val 2
> We can just fix it so that the insert shows up, since t2 > t1.
> For what kind of functionality you need, we need to do soft deletes i.e 
> updates with an ordering value instead of hard deletes
> 1:28
> makes sense?
> Danny Chan  1:32 PM
> we can but that’s not the perfect solution, especially if the dataset comes 
> from a CDC source, for example the MySQL binlog. There is no extra flag in 
> schema for soft delete though.
> 1:37
> In my opinion, it is not about soft DELETE or hard DELETE, even if we do a 
> soft DELETE, the event time (orderingVal) is still important for consumers 
> for versoning. (edited) 
> vc  1:57 PM
> tbh, I don't see us fixing this in two days
> 1:58
> lets do a 0.9.1 after this ?
> 1:58
> shortly after with a bunch of bug fixes and the large pending PRs
> 1:58
> we can even make it 0.10.0
> Danny Chan  1:58 PM
> Yes, the cut time is very soon. We can move the fix to next version.
> vc  1:59 PM
> We have some inconsistent semantics in places
> 1:59
> some are commit time (arrival time) based and some are orderingVal (event 
> time) based
> 2:00
> In the meantime, see HoodieDeleteBlockVersion you can just define a new 
> version for delete block alone for e,g
> 2:00
> and add more information



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-431) Support Parquet in MOR log files

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-431:

Sprint: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-3  (was: Hudi-Sprint-Jan-3)

> Support Parquet in MOR log files
> 
>
> Key: HUDI-431
> URL: https://issues.apache.org/jira/browse/HUDI-431
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Storage Management
>Reporter: sivabalan narayanan
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: help-requested, pull-request-available
> Fix For: 0.11.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> We have a basic implementation of inline filesystem, to read a file format 
> like Parquet, embedded "inline" into another file.  
> [https://github.com/apache/hudi/blob/master/hudi-common/src/test/java/org/apache/hudi/common/fs/inline/TestInLineFileSystem.java]
>  for sample usage.
>  This idea here is to see if we can embed parquet/hfile formats into the Hudi 
> log files, to get columnar reads on the delta log files as well. This helps 
> us speed up query performance, given the log is row based today. Once Inline 
> FS is available, enable parquet logging support with HoodieLogFile. LogFile 
> can expose a writer (essentially ParquetWriter) and users can write records 
> as though writing to parquet files. Similarly on the read path, a reader 
> (parquetReader) will be exposed which the user can use to read data out of 
> it. 
> This Jira tracks work to implement such parquet inlining into the log format 
> and have the writer and reader use it. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2693) Support for incremental queries on MOR table

2022-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-2693:

Fix Version/s: (was: 0.11.0)

> Support for incremental queries on MOR table
> 
>
> Key: HUDI-2693
> URL: https://issues.apache.org/jira/browse/HUDI-2693
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2740) Support for snapshot querying on MOR table

2022-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-2740:

Fix Version/s: (was: 0.11.0)

> Support for snapshot querying on MOR table
> --
>
> Key: HUDI-2740
> URL: https://issues.apache.org/jira/browse/HUDI-2740
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2692) Support for incremental queries on COW table

2022-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-2692:

Fix Version/s: (was: 0.11.0)

> Support for incremental queries on COW table
> 
>
> Key: HUDI-2692
> URL: https://issues.apache.org/jira/browse/HUDI-2692
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2690) Support for read optimized queries on MOR table

2022-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-2690:

Fix Version/s: (was: 0.11.0)

> Support for read optimized queries on MOR table
> ---
>
> Key: HUDI-2690
> URL: https://issues.apache.org/jira/browse/HUDI-2690
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3076) Docs for config file

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3076:
-
Sprint: Hudi-Sprint-Jan-3

> Docs for config file
> 
>
> Key: HUDI-3076
> URL: https://issues.apache.org/jira/browse/HUDI-3076
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Kyle Weller
>Assignee: Kyle Weller
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2950) Address high small objects churn in Bulk Insert/Layout Optimization

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2950:
-
Sprint: Hudi-Sprint-Jan-3

> Address high small objects churn in Bulk Insert/Layout Optimization
> ---
>
> Key: HUDI-2950
> URL: https://issues.apache.org/jira/browse/HUDI-2950
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Based on findings in HUDI-2949, following needs to be addressed to reduce 
> pressure on GC, and improve performance: 
>  * Remove unnecessary `ArrayList` resizing (during Hilbert Curve mapping)
>  * Avoid unnecessary boxing (during Hilbert Curve mapping)
>  * (In Parquet) Avoid allocating `ByteBuffer`s in `compareTo` method invoked 
> from `BinaryStatistics.updateStats` method (on every write to Parquet's 
> `ColumnWriterBase`)
>  * Avoid {{bytesToAvro}} / {{avroToBytes}} ser-de loop (due to use of 
> {{{}OverwriteWithLatestAvroPayload{}}}, to be replaced w/ 
> {{{}RewriteAvroPayload{}}})
>  * Avoid re-allocating substrings (caching them) when fetching 
> {{Path.getName}} (from{{ }}{{HoodieWrapperFileSystem.getBytesWritten)}}
>  * Avoid allocating large deques by {{DefaultSizeEstimator.sizeEstimate}} 
> (currently allocates 16 x 1024 default internal `ArrayDeque`) {{ }}{{}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2735) Fix archival of commits in Java client for Kafka Connect

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2735:
-
Sprint: Hudi-Sprint-Jan-3

> Fix archival of commits in Java client for Kafka Connect
> 
>
> Key: HUDI-2735
> URL: https://issues.apache.org/jira/browse/HUDI-2735
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3075) Docs for Debezium source

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3075:
-
Sprint: Hudi-Sprint-Jan-3

> Docs for Debezium source
> 
>
> Key: HUDI-3075
> URL: https://issues.apache.org/jira/browse/HUDI-3075
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Kyle Weller
>Assignee: Kyle Weller
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3074) Docs for Z-order

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3074:
-
Sprint: Hudi-Sprint-Jan-3

> Docs for Z-order
> 
>
> Key: HUDI-3074
> URL: https://issues.apache.org/jira/browse/HUDI-3074
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Kyle Weller
>Assignee: Kyle Weller
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2948) [UMBRELLA] Hudi Clustering Performance

2022-01-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2948:
--
Epic Name: Table Services - Clustering  (was: Clustering Performance)

> [UMBRELLA] Hudi Clustering Performance
> --
>
> Key: HUDI-2948
> URL: https://issues.apache.org/jira/browse/HUDI-2948
> Project: Apache Hudi
>  Issue Type: Epic
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: performance
> Fix For: 0.11.0
>
>
> This is an umbrella task for the effort of improving Hudi's Clustering 
> performance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3080) Docs for CloudWatch

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3080:
-
Sprint: Hudi-Sprint-Jan-3

> Docs for CloudWatch
> ---
>
> Key: HUDI-3080
> URL: https://issues.apache.org/jira/browse/HUDI-3080
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Kyle Weller
>Assignee: Kyle Weller
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2965) Fix layout optimization to appropriately handle nested columns references

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2965:
-
Sprint: Hudi-Sprint-Jan-3

> Fix layout optimization to appropriately handle nested columns references
> -
>
> Key: HUDI-2965
> URL: https://issues.apache.org/jira/browse/HUDI-2965
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Currently Layout Optimization does only work for top-level columns being 
> specified as columns to be orderedBy.
>  
> We need to make sure it works correctly for the case when the the field 
> reference is specified in the configuration as well (like "a.b.c", 
> referencing the field `c` w/in `b` sub-object of the top-level "c" column)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2693) Support for incremental queries on MOR table

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2693:
-
Reviewers: Vinoth Chandar

> Support for incremental queries on MOR table
> 
>
> Key: HUDI-2693
> URL: https://issues.apache.org/jira/browse/HUDI-2693
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2752) The MOR DELETE block breaks the event time sequence of CDC

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2752:
-
Sprint: Hudi-Sprint-Jan-3

> The MOR DELETE block breaks the event time sequence of CDC
> --
>
> Key: HUDI-2752
> URL: https://issues.apache.org/jira/browse/HUDI-2752
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Currently, the DELETE blocks are always written after the data blocks for one 
> batch of data write, when there are INSERT/UPDATEs after the DELETE, the data 
> would lost.
> What i can thought of is that the DELETE block should at least keep the event 
> time sequence for #preCombine with other record payloads.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3004) Support bootstrap file splits

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3004:
-
Sprint: Hudi-Sprint-Jan-3

> Support bootstrap file splits
> -
>
> Key: HUDI-3004
> URL: https://issues.apache.org/jira/browse/HUDI-3004
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2688) RFC

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2688:
-
Sprint: Hudi-Sprint-Jan-3

> RFC
> ---
>
> Key: HUDI-2688
> URL: https://issues.apache.org/jira/browse/HUDI-2688
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3127) Add a new HoodieHFileReader for Trino with Java 11

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3127:
-
Sprint: Hudi-Sprint-Jan-3

> Add a new HoodieHFileReader for Trino with Java 11
> --
>
> Key: HUDI-3127
> URL: https://issues.apache.org/jira/browse/HUDI-3127
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> To address compatibility issues of HBase 1.2.3 with Java 11 for Hudi Trino 
> connector
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2785) Create Trino setup in docker demo

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2785:
-
Sprint: Hudi-Sprint-Jan-3

> Create Trino setup in docker demo
> -
>
> Key: HUDI-2785
> URL: https://issues.apache.org/jira/browse/HUDI-2785
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2723) Add product integration tests

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2723:
-
Sprint: Hudi-Sprint-Jan-3

> Add product integration tests
> -
>
> Key: HUDI-2723
> URL: https://issues.apache.org/jira/browse/HUDI-2723
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2584) Unit tests for bloom filter index based out of metadata table.

2022-01-03 Thread Manoj Govindassamy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Govindassamy updated HUDI-2584:
-
Story Points: 3  (was: 8)

> Unit tests for bloom filter index based out of metadata table. 
> ---
>
> Key: HUDI-2584
> URL: https://issues.apache.org/jira/browse/HUDI-2584
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: release-blocker
> Fix For: 0.11.0
>
>
> Test Bloom filter based out of metadata table.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2689) Support for snapshot query on COW table

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2689:
-
Sprint: Hudi-Sprint-Jan-3

> Support for snapshot query on COW table
> ---
>
> Key: HUDI-2689
> URL: https://issues.apache.org/jira/browse/HUDI-2689
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1295) Implement: Metadata based bloom index - write path

2022-01-03 Thread Manoj Govindassamy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Govindassamy updated HUDI-1295:
-
Story Points: 5  (was: 20)

> Implement: Metadata based bloom index - write path
> --
>
> Key: HUDI-1295
> URL: https://issues.apache.org/jira/browse/HUDI-1295
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Idea here to maintain our bloom filters outside of parquet for speedier 
> access from bloom.
>  
> - Design and impl bloom filter migration to metadata table. 
> Design:
> schema for the payload: 
> key: partitionName_fileName
> payload schema:
> isDeleted (boolean): true/false
> bloom_type: short
> ser_bloom: byte[] representing serialized bloom filter. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2518) Implement stats/range tracking as a part of Metadata table

2022-01-03 Thread Manoj Govindassamy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Govindassamy updated HUDI-2518:
-
Story Points: 5  (was: 20)

> Implement stats/range tracking as a part of Metadata table
> --
>
> Key: HUDI-2518
> URL: https://issues.apache.org/jira/browse/HUDI-2518
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-2883) Refactor Hive Sync tool /config to use reflection and move to hudi sync common package

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-2883:


Assignee: Vinoth Chandar  (was: Rajesh Mahindra)

> Refactor Hive Sync tool /config to use reflection and move to hudi sync 
> common package
> --
>
> Key: HUDI-2883
> URL: https://issues.apache.org/jira/browse/HUDI-2883
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Rajesh Mahindra
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-2762) Ensure hive can query insert only logs in MOR

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-2762:


Assignee: Alexey Kudinkin  (was: Sagar Sumit)

> Ensure hive can query insert only logs in MOR
> -
>
> Key: HUDI-2762
> URL: https://issues.apache.org/jira/browse/HUDI-2762
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Hive Integration
>Reporter: Rajesh Mahindra
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Currently, we are able to query MOR tables that have base parquet files with 
> inserts an logs files with updates. However, we are currently unable to query 
> tables with insert only log files. Both _ro and _rt tables are returning 0 
> rows. However, hms does create the table and partitions for the table. 
>  
> One sample table is here:
> [https://s3.console.aws.amazon.com/s3/buckets/debug-hive-site?prefix=database/=us-east-2]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2762) Ensure hive can query insert only logs in MOR

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2762:
-
Epic Link: HUDI-2749

> Ensure hive can query insert only logs in MOR
> -
>
> Key: HUDI-2762
> URL: https://issues.apache.org/jira/browse/HUDI-2762
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Hive Integration
>Reporter: Rajesh Mahindra
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 0.11.0
>
>
> Currently, we are able to query MOR tables that have base parquet files with 
> inserts an logs files with updates. However, we are currently unable to query 
> tables with insert only log files. Both _ro and _rt tables are returning 0 
> rows. However, hms does create the table and partitions for the table. 
>  
> One sample table is here:
> [https://s3.console.aws.amazon.com/s3/buckets/debug-hive-site?prefix=database/=us-east-2]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2762) Ensure hive can query insert only logs in MOR

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2762:
-
Epic Link:   (was: HUDI-2519)

> Ensure hive can query insert only logs in MOR
> -
>
> Key: HUDI-2762
> URL: https://issues.apache.org/jira/browse/HUDI-2762
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Hive Integration
>Reporter: Rajesh Mahindra
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Currently, we are able to query MOR tables that have base parquet files with 
> inserts an logs files with updates. However, we are currently unable to query 
> tables with insert only log files. Both _ro and _rt tables are returning 0 
> rows. However, hms does create the table and partitions for the table. 
>  
> One sample table is here:
> [https://s3.console.aws.amazon.com/s3/buckets/debug-hive-site?prefix=database/=us-east-2]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2762) Ensure hive can query insert only logs in MOR

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2762:
-
Priority: Blocker  (was: Major)

> Ensure hive can query insert only logs in MOR
> -
>
> Key: HUDI-2762
> URL: https://issues.apache.org/jira/browse/HUDI-2762
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Hive Integration
>Reporter: Rajesh Mahindra
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Currently, we are able to query MOR tables that have base parquet files with 
> inserts an logs files with updates. However, we are currently unable to query 
> tables with insert only log files. Both _ro and _rt tables are returning 0 
> rows. However, hms does create the table and partitions for the table. 
>  
> One sample table is here:
> [https://s3.console.aws.amazon.com/s3/buckets/debug-hive-site?prefix=database/=us-east-2]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2762) Ensure hive can query insert only logs in MOR

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2762:
-
Priority: Major  (was: Blocker)

> Ensure hive can query insert only logs in MOR
> -
>
> Key: HUDI-2762
> URL: https://issues.apache.org/jira/browse/HUDI-2762
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Hive Integration
>Reporter: Rajesh Mahindra
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 0.11.0
>
>
> Currently, we are able to query MOR tables that have base parquet files with 
> inserts an logs files with updates. However, we are currently unable to query 
> tables with insert only log files. Both _ro and _rt tables are returning 0 
> rows. However, hms does create the table and partitions for the table. 
>  
> One sample table is here:
> [https://s3.console.aws.amazon.com/s3/buckets/debug-hive-site?prefix=database/=us-east-2]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3155) java.lang.NoSuchFieldError for logical timestamp types when run hive sync tool

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3155:
-
Priority: Major  (was: Blocker)

> java.lang.NoSuchFieldError for logical timestamp types when run hive sync tool
> --
>
> Key: HUDI-3155
> URL: https://issues.apache.org/jira/browse/HUDI-3155
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: sev:critical
> Fix For: 0.11.0
>
>
> https://github.com/apache/hudi/issues/4176
> Looks like parquet-column is not part of the bundle



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2762) Ensure hive can query insert only logs in MOR

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2762:
-
Component/s: Hive Integration

> Ensure hive can query insert only logs in MOR
> -
>
> Key: HUDI-2762
> URL: https://issues.apache.org/jira/browse/HUDI-2762
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Hive Integration
>Reporter: Rajesh Mahindra
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Currently, we are able to query MOR tables that have base parquet files with 
> inserts an logs files with updates. However, we are currently unable to query 
> tables with insert only log files. Both _ro and _rt tables are returning 0 
> rows. However, hms does create the table and partitions for the table. 
>  
> One sample table is here:
> [https://s3.console.aws.amazon.com/s3/buckets/debug-hive-site?prefix=database/=us-east-2]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-3012) Investigate: Metadata table write performance impact

2022-01-03 Thread Manoj Govindassamy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Govindassamy closed HUDI-3012.

Resolution: Cannot Reproduce

Closing the issue as the degradation is not seen with spark data source based 
table write.

> Investigate: Metadata table write performance impact
> 
>
> Key: HUDI-3012
> URL: https://issues.apache.org/jira/browse/HUDI-3012
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>
> # Write path: Run Hoodie table inserts/upserts via Spark DataSource or 
> DeltaStreamer and investigate the performance impact
>  # (optional) Read path: Measure the boost on the read side by using the 
> metadata table based file llistings. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3155) java.lang.NoSuchFieldError for logical timestamp types when run hive sync tool

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3155:
-
Labels: sev:critical  (was: )

> java.lang.NoSuchFieldError for logical timestamp types when run hive sync tool
> --
>
> Key: HUDI-3155
> URL: https://issues.apache.org/jira/browse/HUDI-3155
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: sev:critical
> Fix For: 0.11.0
>
>
> https://github.com/apache/hudi/issues/4176
> Looks like parquet-column is not part of the bundle



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3012) Investigate: Metadata table write performance impact

2022-01-03 Thread Manoj Govindassamy (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17468352#comment-17468352
 ] 

Manoj Govindassamy commented on HUDI-3012:
--

 

Ran the spark data source based hudi table upserts with and without metadata 
table and I don't see the performance degradation. The previous test was using 
integ test suite and the delta streamer. Integ test suite brought in many 
moving parts for the write and it did bunch of FS based file listings on its 
own. Closing the issue since i don't see the major performance degradation 
anymore. 

> Investigate: Metadata table write performance impact
> 
>
> Key: HUDI-3012
> URL: https://issues.apache.org/jira/browse/HUDI-3012
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>
> # Write path: Run Hoodie table inserts/upserts via Spark DataSource or 
> DeltaStreamer and investigate the performance impact
>  # (optional) Read path: Measure the boost on the read side by using the 
> metadata table based file llistings. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-52) Implement Savepoints for Merge On Read table #88

2022-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-52?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-52:

Story Points: 2  (was: 1)

> Implement Savepoints for Merge On Read table #88
> 
>
> Key: HUDI-52
> URL: https://issues.apache.org/jira/browse/HUDI-52
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Storage Management, Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: core-flow-ds, help-requested, sev:high, starter
> Fix For: 0.11.0
>
>
> https://github.com/uber/hudi/issues/88



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3010) Enable metadata file listing for Presto directory lister

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-3010:


Assignee: Vinoth Chandar  (was: Sagar Sumit)

> Enable metadata file listing for Presto directory lister
> 
>
> Key: HUDI-3010
> URL: https://issues.apache.org/jira/browse/HUDI-3010
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Presto Integration
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.11.0
>
>
> lets the PR landed. with numbers showing that queries stabilize after a while 
> (w/o metadata) 
> https://github.com/prestodb/presto/pull/17084



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3010) Enable metadata file listing for Presto directory lister

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3010:
-
Sprint: Hudi-Sprint-Jan-3

> Enable metadata file listing for Presto directory lister
> 
>
> Key: HUDI-3010
> URL: https://issues.apache.org/jira/browse/HUDI-3010
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Presto Integration
>Reporter: Vinoth Chandar
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>
> lets the PR landed. with numbers showing that queries stabilize after a while 
> (w/o metadata) 
> https://github.com/prestodb/presto/pull/17084



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2774) Async Clustering via deltstreamer fails with IllegalStateException: Duplicate key [==>20211116123724586__replacecommit__INFLIGHT]

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2774:
-
Sprint: Hudi-Sprint-Jan-3

> Async Clustering via deltstreamer fails with IllegalStateException: Duplicate 
> key [==>2026123724586__replacecommit__INFLIGHT]
> -
>
> Key: HUDI-2774
> URL: https://issues.apache.org/jira/browse/HUDI-2774
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: sivabalan narayanan
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: core-flow-ds, pull-request-available, sev:high
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-11-16 at 12.42.20 PM.png
>
>
> Setup:
> Started deltastreamer with parquet dfs source. source folder did not have any 
> data as such. Enabled async clustering with below props
> ```
> hoodie.clustering.async.max.commits=2
> hoodie.clustering.plan.strategy.sort.columns=type,id
> ```
> Added 1 file to the source folder. and deltastreamer failed during this. 
> commit went through fine. looks like 1st replace commit also went through 
> fine. but deltastreamer failed. I need to understand why deltastreamer tries 
> to schedule a 2nd replace commit as well.  It runs in continuous mode and 
> goes into next round immediately and there is no more data to sync. 
> Note: there is only one partition and one file group in the entire dataset. 
>  
> clustering plan seems to be same in both replace commit requested meta files
> {code:java}
> ^@&<93>c%^Z<81>9%-^KA^B^G^B^NCLUSTER^B^B^B^B^B^B^Afile:/tmp/hudi-deltastreamer-gh-mw/PushEvent/2542ddef-0169-4978-9b1b-84977d6141cf-0_0-49-161_2026130523827.parquet^B^@^BL2542ddef-0169-4978-9b1b-84977d6141cf-0^B^RPushEvent^B^@^@^B^@^B
> ^^TOTAL_LOG_FILES^@^@^@^@^@^@^@^@^VTOTAL_IO_MB^@^@^@^@^@^@^@^@ 
> TOTAL_IO_READ_MB^@^@^@^@^@^@^@^@(TOTAL_LOG_FILES_SIZE^@^@^@^@^@^@^@^@"TOTAL_IO_WRITE_MB^@^@^@^@^@^@^@^@^@^@^B^@^B^@^B^B^Aorg.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy^B^BXhoodie.clustering.plan.strategy.sort.columns^Ntype,id^@^@^B^B^@^@^B^B^A^B^@^@^B&<93>c%^Z<81>9%-^KA{code}
>  
> {code:java}
> ^@^L%b3<85><
> <89>^B^G^B^NCLUSTER^B^B^B^B^B^B^Afile:/tmp/hudi-deltastreamer-gh-mw/PushEvent/2542ddef-0169-4978-9b1b-84977d6141cf-0_0-49-161_2026130523827.parquet^B^@^BL2542ddef-0169-4978-9b1b-84977d6141cf-0^B^RPushEvent^B^@^@^B^@^B
> ^^TOTAL_LOG_FILES^@^@^@^@^@^@^@^@^VTOTAL_IO_MB^@^@^@^@^@^@^@^@ 
> TOTAL_IO_READ_MB^@^@^@^@^@^@^@^@(TOTAL_LOG_FILES_SIZE^@^@^@^@^@^@^@^@"TOTAL_IO_WRITE_MB^@^@^@^@^@^@^@^@^@^@^B^@^B^@^B^B^Aorg.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy^B^BXhoodie.clustering.plan.strategy.sort.columns^Ntype,id^@^@^B^B^@^@^B^B^A^B^@^@^B^L%b3<85><
> <89> {code}
>  
> timeline
> !Screen Shot 2021-11-16 at 12.42.20 PM.png!
>  
> stacktrace:
> {code:java}
> 21/11/16 13:05:20 WARN HoodieDeltaStreamer: Next round 
> 21/11/16 13:05:20 WARN DeltaSync: Extra metadata :: 2026130512915, 
> 2026130512915.commit, = [schema, deltastreamer.checkpoint.key]
> 21/11/16 13:05:23 WARN HoodieDeltaStreamer: Starting async clustering service 
> if required 111 
> 21/11/16 13:05:27 WARN HoodieDeltaStreamer: Scheduled async clustering for 
> instant: 2026130526895
> 21/11/16 13:05:27 WARN HoodieDeltaStreamer: Next round 
> 21/11/16 13:05:27 WARN DeltaSync: Extra metadata :: 2026130523827, 
> 2026130523827.commit, = [schema, deltastreamer.checkpoint.key]
> 21/11/16 13:05:27 WARN HoodieDeltaStreamer: Scheduled async clustering for 
> instant: 2026130527394
> 21/11/16 13:05:27 WARN HoodieDeltaStreamer: Next round 
> 21/11/16 13:05:27 WARN DeltaSync: Extra metadata :: 2026130523827, 
> 2026130523827.commit, = [schema, deltastreamer.checkpoint.key]
> 21/11/16 13:05:28 ERROR Executor: Exception in task 0.0 in stage 74.0 (TID 
> 176)
> java.lang.IllegalStateException: Duplicate key 
> [==>2026130526895__replacecommit__INFLIGHT]
>   at 
> java.util.stream.Collectors.lambda$throwingMerger$0(Collectors.java:133)
>   at java.util.HashMap.merge(HashMap.java:1254)
>   at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1320)
>   at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
>   at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
>   at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>   at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>   at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
>   at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>   at 

[jira] [Updated] (HUDI-3158) Reduce warn logs in Spark SQL INSERT OVERWRITE

2022-01-03 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3158:
-
Description: 
{code:java}
22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file for 
instant [==>20220103192919722__replacecommit__REQUESTED]
22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file for 
instant [==>20220103192919722__replacecommit__REQUESTED]
22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file for 
instant [==>20220103192919722__replacecommit__REQUESTED]
22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file for 
instant [==>20220103192919722__replacecommit__REQUESTED]
22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file for 
instant [==>20220103192919722__replacecommit__REQUESTED]
22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file for 
instant [==>20220103192919722__replacecommit__REQUESTED]
22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file for 
instant [==>20220103192919722__replacecommit__REQUESTED]
22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file for 
instant [==>20220103192919722__replacecommit__REQUESTED]
22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file for 
instant [==>20220103192919722__replacecommit__REQUESTED]{code}
To reduce the repeated warn logs

 

  was:
22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file for 
instant [==>20220103192919722__replacecommit__REQUESTED]
22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file for 
instant [==>20220103192919722__replacecommit__REQUESTED]
22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file for 
instant [==>20220103192919722__replacecommit__REQUESTED]
22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file for 
instant [==>20220103192919722__replacecommit__REQUESTED]
22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file for 
instant [==>20220103192919722__replacecommit__REQUESTED]
22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file for 
instant [==>20220103192919722__replacecommit__REQUESTED]
22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file for 
instant [==>20220103192919722__replacecommit__REQUESTED]
22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file for 
instant [==>20220103192919722__replacecommit__REQUESTED]
22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file for 
instant [==>20220103192919722__replacecommit__REQUESTED]


> Reduce warn logs in Spark SQL INSERT OVERWRITE
> --
>
> Key: HUDI-3158
> URL: https://issues.apache.org/jira/browse/HUDI-3158
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Raymond Xu
>Priority: Major
>  Labels: sev:normal
>
> {code:java}
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]{code}
> To reduce the repeated warn logs
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-1180) Upgrade HBase to 2.x

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-1180:


Assignee: Vinoth Chandar  (was: Alexey Kudinkin)

> Upgrade HBase to 2.x
> 
>
> Key: HUDI-1180
> URL: https://issues.apache.org/jira/browse/HUDI-1180
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: Wenning Ding
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Trying to upgrade HBase to 2.3.3 but ran into several issues.
> According to the Hadoop version support matrix: 
> [http://hbase.apache.org/book.html#hadoop], also need to upgrade Hadoop to 
> 2.8.5+.
>  
> There are several API conflicts between HBase 2.2.3 and HBase 1.2.3, we need 
> to resolve this first. After resolving conflicts, I am able to compile it but 
> then I ran into a tricky jetty version issue during the testing:
> {code:java}
> [ERROR] TestHBaseIndex.testDelete()  Time elapsed: 4.705 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdate()  Time elapsed: 0.174 
> s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdateWithRollback()  Time 
> elapsed: 0.076 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSmallBatchSize()  Time elapsed: 0.122 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTagLocationAndDuplicateUpdate()  Time elapsed: 
> 0.16 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTotalGetsBatching()  Time elapsed: 1.771 s  <<< 
> ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTotalPutsBatching()  Time elapsed: 0.082 s  <<< 
> ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> 34206 [Thread-260] WARN  
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner  - DirectoryScanner: 
> shutdown has been called
> 34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to 
> localhost/127.0.0.1:55924] WARN  
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager  - 
> IncrementalBlockReportManager interrupted
> 34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to 
> localhost/127.0.0.1:55924] WARN  
> org.apache.hadoop.hdfs.server.datanode.DataNode  - Ending block pool service 
> for: Block pool BP-1058834949-10.0.0.2-1597189606506 (Datanode Uuid 
> cb7bd8aa-5d79-4955-b1ec-bdaf7f1b6431) service to localhost/127.0.0.1:55924
> 34246 
> [refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data1/current/BP-1058834949-10.0.0.2-1597189606506]
>  WARN  org.apache.hadoop.fs.CachingGetSpaceUsed  - Thread Interrupted waiting 
> to refresh disk information: sleep interrupted
> 34247 
> [refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data2/current/BP-1058834949-10.0.0.2-1597189606506]
>  WARN  org.apache.hadoop.fs.CachingGetSpaceUsed  - Thread Interrupted waiting 
> to refresh disk information: sleep interrupted
> 37192 [HBase-Metrics2-1] WARN  org.apache.hadoop.metrics2.impl.MetricsConfig  
> - Cannot locate configuration: tried 
> hadoop-metrics2-datanode.properties,hadoop-metrics2.properties
> 43904 
> [master/iad1-ws-cor-r12:0:becomeActiveMaster-SendThread(localhost:58768)] 
> WARN  org.apache.zookeeper.ClientCnxn  - Session 0x173dfeb0c8b0004 for server 
> null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> [INFO] 
> [INFO] Results:
> [INFO] 
> [ERROR] Errors: 
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   

[jira] [Updated] (HUDI-2432) Fix restore by adding a requested instant and restore plan

2022-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-2432:
--
Story Points: 2  (was: 1)

> Fix restore by adding a requested instant and restore plan
> --
>
> Key: HUDI-2432
> URL: https://issues.apache.org/jira/browse/HUDI-2432
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Fix restore by adding a requested instant and restore plan
>  
> Trying to see if we really need a plan. Dumping my thoughts here. 
> Restore internally converts to N no of rollbacks. We fetch active instants in 
> reverse order from timeline and trigger rollbacks 1 by 1. We have already 
> have a patch fixing rollback to add rollback Plan in rollback.requested meta 
> file. So, walking through failure scenarios. 
>  
> With restore, individual rollbacks are not published to timeline. So, if 
> restore fails midway, in the 2nd attempt, only subset of rollback will be 
> applied to metadata table(which got rolledback during the 2nd attempt). so, 
> we need a plan for restore as well.
> But with our enhancement to rollback to publish a plan, Rollback.requested 
> can't be skipped and we have to publish to timeline. So, here is what will 
> happen w/o a restore plan.
>  
> start restore
>     rollback commit N
>           rollback.requested for commit N// plan.
>           execute rollback, but do not publish to timeline. so this will not 
> get applied to metadata table. 
>     rollback commit N-1
>            rollback.requested for commit N-1 // plan
>           execute rollback, but do not publish to timeline. again, will not 
> get applied to metadata table. 
>      .
> commit restore and publish. this will get applied to metadata table. 
> Once we are done committing restore, we can remove all rollback.requested 
> files if needed. 
>  
> Failure scenarios: 
> If after 2 rollbacks, we fail. 
> on re-attempt, we will process remaining commits only, since active timeline 
> may not report commitN and commitN-1 as active. So, we can do something like 
> below w/ a restore plan.
>  
> 1. start restore
>    2. schedule rollback for all of them. 
>         serialize all commit instants that need to be rolledback along with 
> the rollback plan. // by now, we would have created rollback.requested meta 
> file for all commits that need to be rolled back. 
>     3. now execute rollback one by one. // do not publish to timeline once 
> done. also changes should not be applied to metadata table. 
> 4. collect rollback commit metadata from all individual rollbacks and create 
> the restore commit metadata. there could be some commits which was already 
> rolledback, and for those, we need to manually create rollback metadata based 
> on rollback plan. More details in next para. commit the restore and publish. 
> only this will get applied to metadata table(which inturn will unwrap the 
> individual rollback metadata and apply it to metadata table). 
>  
> Failures:
> if we fail after 2nd rollback:
> on 2nd attempt, we will look at retstore plan for all commits that needs to 
> be rolledback. So, we can't really rollback the first 2 since they are 
> already rolled back. And so, we will manually create rollback metadata from 
> rollback.requested meta file. and for rest, we will follow the regular flow 
> of executing actual rollback and collecting rollback metadata. Once complete, 
> we will serialize all this info in restore metadata which gets applied to 
> metadata table. 
>  
> Alternatives: But since restore anyway is a destructive operation and is 
> advised to stop all processes, we do have an option to clean up metadata 
> table and rebootstrap completely once restore is complete. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2432) Fix restore by adding a requested instant and restore plan

2022-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-2432:
--
Story Points: 1  (was: 2)

> Fix restore by adding a requested instant and restore plan
> --
>
> Key: HUDI-2432
> URL: https://issues.apache.org/jira/browse/HUDI-2432
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Fix restore by adding a requested instant and restore plan
>  
> Trying to see if we really need a plan. Dumping my thoughts here. 
> Restore internally converts to N no of rollbacks. We fetch active instants in 
> reverse order from timeline and trigger rollbacks 1 by 1. We have already 
> have a patch fixing rollback to add rollback Plan in rollback.requested meta 
> file. So, walking through failure scenarios. 
>  
> With restore, individual rollbacks are not published to timeline. So, if 
> restore fails midway, in the 2nd attempt, only subset of rollback will be 
> applied to metadata table(which got rolledback during the 2nd attempt). so, 
> we need a plan for restore as well.
> But with our enhancement to rollback to publish a plan, Rollback.requested 
> can't be skipped and we have to publish to timeline. So, here is what will 
> happen w/o a restore plan.
>  
> start restore
>     rollback commit N
>           rollback.requested for commit N// plan.
>           execute rollback, but do not publish to timeline. so this will not 
> get applied to metadata table. 
>     rollback commit N-1
>            rollback.requested for commit N-1 // plan
>           execute rollback, but do not publish to timeline. again, will not 
> get applied to metadata table. 
>      .
> commit restore and publish. this will get applied to metadata table. 
> Once we are done committing restore, we can remove all rollback.requested 
> files if needed. 
>  
> Failure scenarios: 
> If after 2 rollbacks, we fail. 
> on re-attempt, we will process remaining commits only, since active timeline 
> may not report commitN and commitN-1 as active. So, we can do something like 
> below w/ a restore plan.
>  
> 1. start restore
>    2. schedule rollback for all of them. 
>         serialize all commit instants that need to be rolledback along with 
> the rollback plan. // by now, we would have created rollback.requested meta 
> file for all commits that need to be rolled back. 
>     3. now execute rollback one by one. // do not publish to timeline once 
> done. also changes should not be applied to metadata table. 
> 4. collect rollback commit metadata from all individual rollbacks and create 
> the restore commit metadata. there could be some commits which was already 
> rolledback, and for those, we need to manually create rollback metadata based 
> on rollback plan. More details in next para. commit the restore and publish. 
> only this will get applied to metadata table(which inturn will unwrap the 
> individual rollback metadata and apply it to metadata table). 
>  
> Failures:
> if we fail after 2nd rollback:
> on 2nd attempt, we will look at retstore plan for all commits that needs to 
> be rolledback. So, we can't really rollback the first 2 since they are 
> already rolled back. And so, we will manually create rollback metadata from 
> rollback.requested meta file. and for rest, we will follow the regular flow 
> of executing actual rollback and collecting rollback metadata. Once complete, 
> we will serialize all this info in restore metadata which gets applied to 
> metadata table. 
>  
> Alternatives: But since restore anyway is a destructive operation and is 
> advised to stop all processes, we do have an option to clean up metadata 
> table and rebootstrap completely once restore is complete. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2432) Fix restore by adding a requested instant and restore plan

2022-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-2432:
--
Story Points: 1  (was: 5)

> Fix restore by adding a requested instant and restore plan
> --
>
> Key: HUDI-2432
> URL: https://issues.apache.org/jira/browse/HUDI-2432
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Fix restore by adding a requested instant and restore plan
>  
> Trying to see if we really need a plan. Dumping my thoughts here. 
> Restore internally converts to N no of rollbacks. We fetch active instants in 
> reverse order from timeline and trigger rollbacks 1 by 1. We have already 
> have a patch fixing rollback to add rollback Plan in rollback.requested meta 
> file. So, walking through failure scenarios. 
>  
> With restore, individual rollbacks are not published to timeline. So, if 
> restore fails midway, in the 2nd attempt, only subset of rollback will be 
> applied to metadata table(which got rolledback during the 2nd attempt). so, 
> we need a plan for restore as well.
> But with our enhancement to rollback to publish a plan, Rollback.requested 
> can't be skipped and we have to publish to timeline. So, here is what will 
> happen w/o a restore plan.
>  
> start restore
>     rollback commit N
>           rollback.requested for commit N// plan.
>           execute rollback, but do not publish to timeline. so this will not 
> get applied to metadata table. 
>     rollback commit N-1
>            rollback.requested for commit N-1 // plan
>           execute rollback, but do not publish to timeline. again, will not 
> get applied to metadata table. 
>      .
> commit restore and publish. this will get applied to metadata table. 
> Once we are done committing restore, we can remove all rollback.requested 
> files if needed. 
>  
> Failure scenarios: 
> If after 2 rollbacks, we fail. 
> on re-attempt, we will process remaining commits only, since active timeline 
> may not report commitN and commitN-1 as active. So, we can do something like 
> below w/ a restore plan.
>  
> 1. start restore
>    2. schedule rollback for all of them. 
>         serialize all commit instants that need to be rolledback along with 
> the rollback plan. // by now, we would have created rollback.requested meta 
> file for all commits that need to be rolled back. 
>     3. now execute rollback one by one. // do not publish to timeline once 
> done. also changes should not be applied to metadata table. 
> 4. collect rollback commit metadata from all individual rollbacks and create 
> the restore commit metadata. there could be some commits which was already 
> rolledback, and for those, we need to manually create rollback metadata based 
> on rollback plan. More details in next para. commit the restore and publish. 
> only this will get applied to metadata table(which inturn will unwrap the 
> individual rollback metadata and apply it to metadata table). 
>  
> Failures:
> if we fail after 2nd rollback:
> on 2nd attempt, we will look at retstore plan for all commits that needs to 
> be rolledback. So, we can't really rollback the first 2 since they are 
> already rolled back. And so, we will manually create rollback metadata from 
> rollback.requested meta file. and for rest, we will follow the regular flow 
> of executing actual rollback and collecting rollback metadata. Once complete, 
> we will serialize all this info in restore metadata which gets applied to 
> metadata table. 
>  
> Alternatives: But since restore anyway is a destructive operation and is 
> advised to stop all processes, we do have an option to clean up metadata 
> table and rebootstrap completely once restore is complete. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-2432) Fix restore by adding a requested instant and restore plan

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-2432:


Assignee: sivabalan narayanan  (was: Manoj Govindassamy)

> Fix restore by adding a requested instant and restore plan
> --
>
> Key: HUDI-2432
> URL: https://issues.apache.org/jira/browse/HUDI-2432
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Fix restore by adding a requested instant and restore plan
>  
> Trying to see if we really need a plan. Dumping my thoughts here. 
> Restore internally converts to N no of rollbacks. We fetch active instants in 
> reverse order from timeline and trigger rollbacks 1 by 1. We have already 
> have a patch fixing rollback to add rollback Plan in rollback.requested meta 
> file. So, walking through failure scenarios. 
>  
> With restore, individual rollbacks are not published to timeline. So, if 
> restore fails midway, in the 2nd attempt, only subset of rollback will be 
> applied to metadata table(which got rolledback during the 2nd attempt). so, 
> we need a plan for restore as well.
> But with our enhancement to rollback to publish a plan, Rollback.requested 
> can't be skipped and we have to publish to timeline. So, here is what will 
> happen w/o a restore plan.
>  
> start restore
>     rollback commit N
>           rollback.requested for commit N// plan.
>           execute rollback, but do not publish to timeline. so this will not 
> get applied to metadata table. 
>     rollback commit N-1
>            rollback.requested for commit N-1 // plan
>           execute rollback, but do not publish to timeline. again, will not 
> get applied to metadata table. 
>      .
> commit restore and publish. this will get applied to metadata table. 
> Once we are done committing restore, we can remove all rollback.requested 
> files if needed. 
>  
> Failure scenarios: 
> If after 2 rollbacks, we fail. 
> on re-attempt, we will process remaining commits only, since active timeline 
> may not report commitN and commitN-1 as active. So, we can do something like 
> below w/ a restore plan.
>  
> 1. start restore
>    2. schedule rollback for all of them. 
>         serialize all commit instants that need to be rolledback along with 
> the rollback plan. // by now, we would have created rollback.requested meta 
> file for all commits that need to be rolled back. 
>     3. now execute rollback one by one. // do not publish to timeline once 
> done. also changes should not be applied to metadata table. 
> 4. collect rollback commit metadata from all individual rollbacks and create 
> the restore commit metadata. there could be some commits which was already 
> rolledback, and for those, we need to manually create rollback metadata based 
> on rollback plan. More details in next para. commit the restore and publish. 
> only this will get applied to metadata table(which inturn will unwrap the 
> individual rollback metadata and apply it to metadata table). 
>  
> Failures:
> if we fail after 2nd rollback:
> on 2nd attempt, we will look at retstore plan for all commits that needs to 
> be rolledback. So, we can't really rollback the first 2 since they are 
> already rolled back. And so, we will manually create rollback metadata from 
> rollback.requested meta file. and for rest, we will follow the regular flow 
> of executing actual rollback and collecting rollback metadata. Once complete, 
> we will serialize all this info in restore metadata which gets applied to 
> metadata table. 
>  
> Alternatives: But since restore anyway is a destructive operation and is 
> advised to stop all processes, we do have an option to clean up metadata 
> table and rebootstrap completely once restore is complete. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2432) Fix restore by adding a requested instant and restore plan

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2432:
-
Sprint: Hudi-Sprint-Jan-3

> Fix restore by adding a requested instant and restore plan
> --
>
> Key: HUDI-2432
> URL: https://issues.apache.org/jira/browse/HUDI-2432
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Fix restore by adding a requested instant and restore plan
>  
> Trying to see if we really need a plan. Dumping my thoughts here. 
> Restore internally converts to N no of rollbacks. We fetch active instants in 
> reverse order from timeline and trigger rollbacks 1 by 1. We have already 
> have a patch fixing rollback to add rollback Plan in rollback.requested meta 
> file. So, walking through failure scenarios. 
>  
> With restore, individual rollbacks are not published to timeline. So, if 
> restore fails midway, in the 2nd attempt, only subset of rollback will be 
> applied to metadata table(which got rolledback during the 2nd attempt). so, 
> we need a plan for restore as well.
> But with our enhancement to rollback to publish a plan, Rollback.requested 
> can't be skipped and we have to publish to timeline. So, here is what will 
> happen w/o a restore plan.
>  
> start restore
>     rollback commit N
>           rollback.requested for commit N// plan.
>           execute rollback, but do not publish to timeline. so this will not 
> get applied to metadata table. 
>     rollback commit N-1
>            rollback.requested for commit N-1 // plan
>           execute rollback, but do not publish to timeline. again, will not 
> get applied to metadata table. 
>      .
> commit restore and publish. this will get applied to metadata table. 
> Once we are done committing restore, we can remove all rollback.requested 
> files if needed. 
>  
> Failure scenarios: 
> If after 2 rollbacks, we fail. 
> on re-attempt, we will process remaining commits only, since active timeline 
> may not report commitN and commitN-1 as active. So, we can do something like 
> below w/ a restore plan.
>  
> 1. start restore
>    2. schedule rollback for all of them. 
>         serialize all commit instants that need to be rolledback along with 
> the rollback plan. // by now, we would have created rollback.requested meta 
> file for all commits that need to be rolled back. 
>     3. now execute rollback one by one. // do not publish to timeline once 
> done. also changes should not be applied to metadata table. 
> 4. collect rollback commit metadata from all individual rollbacks and create 
> the restore commit metadata. there could be some commits which was already 
> rolledback, and for those, we need to manually create rollback metadata based 
> on rollback plan. More details in next para. commit the restore and publish. 
> only this will get applied to metadata table(which inturn will unwrap the 
> individual rollback metadata and apply it to metadata table). 
>  
> Failures:
> if we fail after 2nd rollback:
> on 2nd attempt, we will look at retstore plan for all commits that needs to 
> be rolledback. So, we can't really rollback the first 2 since they are 
> already rolled back. And so, we will manually create rollback metadata from 
> rollback.requested meta file. and for rest, we will follow the regular flow 
> of executing actual rollback and collecting rollback metadata. Once complete, 
> we will serialize all this info in restore metadata which gets applied to 
> metadata table. 
>  
> Alternatives: But since restore anyway is a destructive operation and is 
> advised to stop all processes, we do have an option to clean up metadata 
> table and rebootstrap completely once restore is complete. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2477) Restore fails after adding rollback plan and rollback.requested instant w/ metadata enabled

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2477:
-
Sprint: Hudi-Sprint-Jan-3

> Restore fails after adding rollback plan and rollback.requested instant w/ 
> metadata enabled
> ---
>
> Key: HUDI-2477
> URL: https://issues.apache.org/jira/browse/HUDI-2477
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.11.0
>
>
> restore triggers rollback of N commits and then finally will commit the 
> restore. None of rollbacks will be published to timeline. 
> But after we have added the rollback.requested instant, restore is breaking 
> w/ metadata enabled. 
> Here is what is happening:
> Restore
>      schedule rollback for all of N commits. this will produce 
> rollback.requested instants to timeline. Remember we can't skip this 
> publishing, bcoz, rollback action executor depends on this. 
>     trigger rollback action executor. which will execute the rollback. but 
> this time we may not publish the rollbacks. and so there won't be a rollback 
> completed instant. 
> and now to finalize restore, we apply the changes to metadata table before we 
> can commit the restore to datatable. Here is where the issue is. We do check 
> if bootstrapping is required. chances that last synced instant to metadata 
> table is not active anymore in data table and so it triggers a bootstrap. but 
> we do allow bootstrap only if there are no pending operations in datatable. 
> But all rollbacks are surfacing as pending operations and hence we fail here. 
>  
> This could also be an issue when we try to play with bootstrap in original 
> dataset. 
> bootstrap. and for some reason you want to rollback bootstrap. this might end 
> up in this state too. 
> to illustrate clearly. 
> bootstrap
>     also apply changes to metadata. there is only one commit.
> rollback bootstrap
>    this is a restore operation. so, we first do a rollback which will create 
> rollback.requested instant. 
>                and to finalize restore, we try to apply the restore to 
> metadata. 
>                     this goes into bootstrap code path. last synced instant 
> is not found in datatimeline. we assume its archived and so trigger a 
> rebootstrap and so delete the metadata table. 
>                     and then try to do the actual bootstrap. but since there 
> is a pending operation in datatimeline (rollback.requested), we will not do 
> any bootstrap only. and so the state remains. i.e. metadata table is deleted. 
> and the actually applying restore commit will fail. 
>  
>  
> We also need to think even if metadata is enabled, should we leave the 
> rollback instant in timeline itself.or should we clean it up after committing 
> restore to timeline. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2477) Restore fails after adding rollback plan and rollback.requested instant w/ metadata enabled

2022-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-2477:
--
Story Points: 1  (was: 5)

> Restore fails after adding rollback plan and rollback.requested instant w/ 
> metadata enabled
> ---
>
> Key: HUDI-2477
> URL: https://issues.apache.org/jira/browse/HUDI-2477
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.11.0
>
>
> restore triggers rollback of N commits and then finally will commit the 
> restore. None of rollbacks will be published to timeline. 
> But after we have added the rollback.requested instant, restore is breaking 
> w/ metadata enabled. 
> Here is what is happening:
> Restore
>      schedule rollback for all of N commits. this will produce 
> rollback.requested instants to timeline. Remember we can't skip this 
> publishing, bcoz, rollback action executor depends on this. 
>     trigger rollback action executor. which will execute the rollback. but 
> this time we may not publish the rollbacks. and so there won't be a rollback 
> completed instant. 
> and now to finalize restore, we apply the changes to metadata table before we 
> can commit the restore to datatable. Here is where the issue is. We do check 
> if bootstrapping is required. chances that last synced instant to metadata 
> table is not active anymore in data table and so it triggers a bootstrap. but 
> we do allow bootstrap only if there are no pending operations in datatable. 
> But all rollbacks are surfacing as pending operations and hence we fail here. 
>  
> This could also be an issue when we try to play with bootstrap in original 
> dataset. 
> bootstrap. and for some reason you want to rollback bootstrap. this might end 
> up in this state too. 
> to illustrate clearly. 
> bootstrap
>     also apply changes to metadata. there is only one commit.
> rollback bootstrap
>    this is a restore operation. so, we first do a rollback which will create 
> rollback.requested instant. 
>                and to finalize restore, we try to apply the restore to 
> metadata. 
>                     this goes into bootstrap code path. last synced instant 
> is not found in datatimeline. we assume its archived and so trigger a 
> rebootstrap and so delete the metadata table. 
>                     and then try to do the actual bootstrap. but since there 
> is a pending operation in datatimeline (rollback.requested), we will not do 
> any bootstrap only. and so the state remains. i.e. metadata table is deleted. 
> and the actually applying restore commit will fail. 
>  
>  
> We also need to think even if metadata is enabled, should we leave the 
> rollback instant in timeline itself.or should we clean it up after committing 
> restore to timeline. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-2477) Restore fails after adding rollback plan and rollback.requested instant w/ metadata enabled

2022-01-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-2477:


Assignee: sivabalan narayanan  (was: Manoj Govindassamy)

> Restore fails after adding rollback plan and rollback.requested instant w/ 
> metadata enabled
> ---
>
> Key: HUDI-2477
> URL: https://issues.apache.org/jira/browse/HUDI-2477
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.11.0
>
>
> restore triggers rollback of N commits and then finally will commit the 
> restore. None of rollbacks will be published to timeline. 
> But after we have added the rollback.requested instant, restore is breaking 
> w/ metadata enabled. 
> Here is what is happening:
> Restore
>      schedule rollback for all of N commits. this will produce 
> rollback.requested instants to timeline. Remember we can't skip this 
> publishing, bcoz, rollback action executor depends on this. 
>     trigger rollback action executor. which will execute the rollback. but 
> this time we may not publish the rollbacks. and so there won't be a rollback 
> completed instant. 
> and now to finalize restore, we apply the changes to metadata table before we 
> can commit the restore to datatable. Here is where the issue is. We do check 
> if bootstrapping is required. chances that last synced instant to metadata 
> table is not active anymore in data table and so it triggers a bootstrap. but 
> we do allow bootstrap only if there are no pending operations in datatable. 
> But all rollbacks are surfacing as pending operations and hence we fail here. 
>  
> This could also be an issue when we try to play with bootstrap in original 
> dataset. 
> bootstrap. and for some reason you want to rollback bootstrap. this might end 
> up in this state too. 
> to illustrate clearly. 
> bootstrap
>     also apply changes to metadata. there is only one commit.
> rollback bootstrap
>    this is a restore operation. so, we first do a rollback which will create 
> rollback.requested instant. 
>                and to finalize restore, we try to apply the restore to 
> metadata. 
>                     this goes into bootstrap code path. last synced instant 
> is not found in datatimeline. we assume its archived and so trigger a 
> rebootstrap and so delete the metadata table. 
>                     and then try to do the actual bootstrap. but since there 
> is a pending operation in datatimeline (rollback.requested), we will not do 
> any bootstrap only. and so the state remains. i.e. metadata table is deleted. 
> and the actually applying restore commit will fail. 
>  
>  
> We also need to think even if metadata is enabled, should we leave the 
> rollback instant in timeline itself.or should we clean it up after committing 
> restore to timeline. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


  1   2   3   4   5   >