date:20221001

[GitHub] [hudi] xushiyan commented on a diff in pull request #6827: [HUDI-4950] Fix read log lead to oom not be catched issue

2022-10-01 Thread GitBox



xushiyan commented on code in PR #6827:
URL: https://github.com/apache/hudi/pull/6827#discussion_r985062923


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java:
##
@@ -557,8 +557,9 @@ public static Schema getLatestTableSchema(String path, 
org.apache.hadoop.conf.Co
 try {
   HoodieTableMetaClient metaClient = StreamerUtil.createMetaClient(path, 
hadoopConf);
   return getTableAvroSchema(metaClient, false);
-} catch (Exception e) {
-  LOG.warn("Error while resolving the latest table schema", e);
+} catch (Throwable throwable) {
+  LOG.warn("Error while resolving the latest table schema.", throwable);
+  // ignored
 }

Review Comment:
   @TJX2014 then `getLatestTableSchema` should be fixed too. We don't catch 
Throwable because Error should not be caught. Quote from javadoc
   
   > An Error is a subclass of Throwable that indicates serious problems that a 
reasonable application should not try to catch
   
   So what is the strong reason to catch and ignore errors like OOM? You'd need 
to fail loud in that case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4716) Avoid bundle parquet in hadoop-mr

2022-10-01 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4716:
-
Sprint: 2022/10/03

> Avoid bundle parquet in hadoop-mr
> -
>
> Key: HUDI-4716
> URL: https://issues.apache.org/jira/browse/HUDI-4716
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies
>Reporter: Raymond Xu
>Priority: Blocker
> Fix For: 0.13.0
>
>
> As per discussion in 
> https://github.com/apache/hudi/pull/5250#discussion_r930144788
> This will reduce the bundle size and uphold the principle of not bundling 
> file storage format.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #6840: [HUDI-4916] Implement change log feed for Flink

2022-10-01 Thread GitBox



hudi-bot commented on PR #6840:
URL: https://github.com/apache/hudi/pull/6840#issuecomment-1264289570

   
   ## CI report:
   
   * 91ef1a0ff644c2a9bf855a409cb6f46fcaeb7817 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11949)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6846: [HUDI-4962] Move cloud dependencies to cloud modules

2022-10-01 Thread GitBox



hudi-bot commented on PR #6846:
URL: https://github.com/apache/hudi/pull/6846#issuecomment-1264293012

   
   ## CI report:
   
   * 6137f878fb8c1bf572e53cc56624341a39893c31 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11950)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] wzx140 commented on pull request #6745: Fix comment in RFC46

2022-10-01 Thread GitBox



wzx140 commented on PR #6745:
URL: https://github.com/apache/hudi/pull/6745#issuecomment-1264301226

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6838: [MINOR] Update azure image and balance CI jobs

2022-10-01 Thread GitBox



hudi-bot commented on PR #6838:
URL: https://github.com/apache/hudi/pull/6838#issuecomment-1264306911

   
   ## CI report:
   
   * b4875afb16a2a8bdd0bce03f518af4fee9ada2a7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11948)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6745: Fix comment in RFC46

2022-10-01 Thread GitBox



hudi-bot commented on PR #6745:
URL: https://github.com/apache/hudi/pull/6745#issuecomment-1264306863

   
   ## CI report:
   
   * b7a71b7de263c5117e199df45cba0ead191159a3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11938)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11952)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6847: [HUDI-4963] Extend InProcessLockProvider to support multiple table ingestion

2022-10-01 Thread GitBox



hudi-bot commented on PR #6847:
URL: https://github.com/apache/hudi/pull/6847#issuecomment-1264306927

   
   ## CI report:
   
   * 125682ead07b967f1a63cbf36e39cffcf906d5c5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11951)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] kazdy commented on issue #6832: [SUPPORT] AWS Glue 3.0 fail to write dataset with hudi (hive sync issue)

2022-10-01 Thread GitBox



kazdy commented on issue #6832:
URL: https://github.com/apache/hudi/issues/6832#issuecomment-1264315897

   hi @dragonH 
   
   I had the same issue, basically glue data catalog converts table names and 
column names to lowarcase. Then in glue client for hive metastore (the aws one) 
it is not taken into account and because Customer_Sample_Hudi becomes 
customer_sample_hudi and glue client is case sensitive and throws this error:
   `java.lang.IllegalArgumentException: Partitions must be in the same table`
   
   Here's exactly the piece of code that causes this issue:
   
   `checkArgument(tbl.getDbName().equals(partition.getDbName()), "Partitions 
must be in the same DB");`
   
   
https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/blob/ac318d42b3df00c1ada1be6a3305bcf9bd4895f0/aws-glue-datacatalog-client-common/src/main/java/com/amazonaws/glue/catalog/metastore/GlueMetastoreClientDelegate.java#L679
   
   You can try using AWSGlueCatalogSyncClient, It might be fixed there. 
   If you don't want to change the sync tool, just change this:
   `'hoodie.datasource.hive_sync.table': 'Customer_Sample_Hudi'`
   to
   `'hoodie.datasource.hive_sync.table': 'customer_sample_hudi'`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 merged pull request #6840: [HUDI-4916] Implement change log feed for Flink

2022-10-01 Thread GitBox



danny0405 merged PR #6840:
URL: https://github.com/apache/hudi/pull/6840


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [HUDI-4916] Implement change log feed for Flink (#6840)

2022-10-01 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new f3d4ce919d [HUDI-4916] Implement change log feed for Flink (#6840)
f3d4ce919d is described below

commit f3d4ce919d4909f9533255ee2a9a0450c8e44c73
Author: Danny Chan 
AuthorDate: Sat Oct 1 18:21:23 2022 +0800

[HUDI-4916] Implement change log feed for Flink (#6840)
---
 .../org/apache/hudi/config/HoodieWriteConfig.java  |   5 +
 .../java/org/apache/hudi/io/HoodieCDCLogger.java   |   2 +-
 .../hudi/io/HoodieMergeHandleWithChangeLog.java|   8 +-
 .../apache/hudi/client/HoodieFlinkWriteClient.java |  50 +-
 .../java/org/apache/hudi/io/FlinkAppendHandle.java |   6 -
 .../FlinkMergeAndReplaceHandleWithChangeLog.java}  |  51 +-
 .../hudi/io/FlinkMergeHandleWithChangeLog.java}|  52 +-
 .../apache/hudi/io/FlinkWriteHandleFactory.java| 283 
 .../java/org/apache/hudi/io/MiniBatchHandle.java   |   8 -
 .../hudi/common/model/WriteOperationType.java  |  14 +
 .../hudi/common/table/HoodieTableConfig.java   |   4 +-
 .../hudi/common/table/cdc/HoodieCDCExtractor.java  | 100 ++-
 .../hudi/common/table/cdc/HoodieCDCFileSplit.java  |  31 +-
 .../cdc/HoodieCDCSupplementalLoggingMode.java  |   4 +-
 .../table/log/HoodieCDCLogRecordIterator.java  |  14 +-
 .../table/log/HoodieUnMergedLogRecordScanner.java  |   6 +-
 .../apache/hudi/configuration/FlinkOptions.java|  22 +
 .../apache/hudi/configuration/OptionsResolver.java |   9 +
 .../apache/hudi/source/IncrementalInputSplits.java | 114 +++-
 .../hudi/source/StreamReadMonitoringFunction.java  |  11 +-
 .../org/apache/hudi/table/HoodieTableSource.java   |  34 +-
 .../org/apache/hudi/table/format/FormatUtils.java  |  20 +
 .../hudi/table/format/cdc/CdcInputFormat.java  | 723 +
 .../hudi/table/format/cdc/CdcInputSplit.java   |  61 ++
 .../table/format/mor/MergeOnReadInputFormat.java   | 166 +++--
 .../table/format/mor/MergeOnReadInputSplit.java|   2 +-
 .../org/apache/hudi/util/RowDataProjection.java|   1 +
 .../java/org/apache/hudi/util/StreamerUtil.java|   4 +
 .../apache/hudi/table/ITTestHoodieDataSource.java  |  36 +
 .../apache/hudi/table/format/TestInputFormat.java  |  95 ++-
 .../test/java/org/apache/hudi/utils/TestData.java  |  30 +
 .../scala/org/apache/hudi/cdc/CDCRelation.scala|  15 +-
 .../scala/org/apache/hudi/cdc/HoodieCDCRDD.scala   |  32 +-
 33 files changed, 1732 insertions(+), 281 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
index fd7972a719..3b3995c8f8 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
@@ -975,6 +975,11 @@ public class HoodieWriteConfig extends HoodieConfig {
 return getString(KEYGENERATOR_CLASS_NAME);
   }
 
+  public boolean isCDCEnabled() {
+return getBooleanOrDefault(
+HoodieTableConfig.CDC_ENABLED, 
HoodieTableConfig.CDC_ENABLED.defaultValue());
+  }
+
   public boolean isConsistentLogicalTimestampEnabled() {
 return 
getBooleanOrDefault(KeyGeneratorOptions.KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED);
   }
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCDCLogger.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCDCLogger.java
index f57b195c76..303eea76db 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCDCLogger.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCDCLogger.java
@@ -95,7 +95,7 @@ public class HoodieCDCLogger implements Closeable {
   ? HoodieRecord.RECORD_KEY_METADATA_FIELD
   : tableConfig.getRecordKeyFieldProp();
   this.cdcWriter = cdcWriter;
-  this.cdcSupplementalLoggingMode = config.getCDCSupplementalLoggingMode();
+  this.cdcSupplementalLoggingMode = 
tableConfig.cdcSupplementalLoggingMode();
   this.cdcSchema = HoodieCDCUtils.schemaBySupplementalLoggingMode(
   cdcSupplementalLoggingMode,
   dataSchema
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java
index 12e48ffbb4..910bc42158 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java
@@ -21,6 +21,7 @@ package org.apache.hudi.io;
 import org.apache.hudi.client.WriteStatus;
 import org.apache.

[jira] [Commented] (HUDI-4916) Implement change log feed for Flink

2022-10-01 Thread Danny Chen (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611859#comment-17611859
 ] 

Danny Chen commented on HUDI-4916:
--

Fixed via master branch: f3d4ce919d4909f9533255ee2a9a0450c8e44c73

> Implement change log feed for Flink
> ---
>
> Key: HUDI-4916
> URL: https://issues.apache.org/jira/browse/HUDI-4916
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HUDI-4916) Implement change log feed for Flink

2022-10-01 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen resolved HUDI-4916.
--

> Implement change log feed for Flink
> ---
>
> Key: HUDI-4916
> URL: https://issues.apache.org/jira/browse/HUDI-4916
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4769) Option read.streaming.skip_compaction skips delta commit

2022-10-01 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4769:
-
Labels: pull-request-available  (was: )

> Option read.streaming.skip_compaction skips delta commit
> 
>
> Key: HUDI-4769
> URL: https://issues.apache.org/jira/browse/HUDI-4769
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink, flink-sql
>Reporter: nonggia.liang
>Assignee: Sagar Sumit
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Option read.streaming.skip_compaction was introduced to avoid consuming 
> duplicate data from delta-commits and compactions in MOR table.
> But the option may cause delta-commits, here the case:
> Support we have a timeline (d for delta-commit, C for compaction/commit):
> d1 --> d2 --> C3 --> d4 --> d5 -->
> t1...t2..
> Let's say scans for streaming read happen at time t1 and t2, when d1 and d5 
> is the latest instant seperately. 
> When we scan at t2 with read.streaming.skip_compaction=true, we get a latest 
> merged fileslice with only log files containing d4+d5.  So d2 is skipped.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] danny0405 opened a new pull request, #6848: [HUDI-4769] Option read.streaming.skip_compaction skips delta commit

2022-10-01 Thread GitBox



danny0405 opened a new pull request, #6848:
URL: https://github.com/apache/hudi/pull/6848

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #6845: [HUDI-4945] Support to trigger the clean in the flink batch mode.

2022-10-01 Thread GitBox



danny0405 commented on code in PR #6845:
URL: https://github.com/apache/hudi/pull/6845#discussion_r985083279


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSink.java:
##
@@ -96,9 +96,10 @@ public SinkRuntimeProvider getSinkRuntimeProvider(Context 
context) {
   pipeline = Pipelines.hoodieStreamWrite(conf, hoodieRecordDataStream);
   // compaction
   if (OptionsResolver.needsAsyncCompaction(conf)) {
-// use synchronous compaction for bounded source.
+// use synchronous compaction and clean for bounded source.
 if (context.isBounded()) {
   conf.setBoolean(FlinkOptions.COMPACTION_ASYNC_ENABLED, false);
+  conf.setBoolean(FlinkOptions.CLEAN_ASYNC_ENABLED, false);
 }

Review Comment:
   We have another fix in #6515 , does that solve your problem ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6848: [HUDI-4769] Option read.streaming.skip_compaction skips delta commit

2022-10-01 Thread GitBox



hudi-bot commented on PR #6848:
URL: https://github.com/apache/hudi/pull/6848#issuecomment-1264325892

   
   ## CI report:
   
   * 6318c29ab4e0415d365db365aac92f1f562a4b68 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6848: [HUDI-4769] Option read.streaming.skip_compaction skips delta commit

2022-10-01 Thread GitBox



hudi-bot commented on PR #6848:
URL: https://github.com/apache/hudi/pull/6848#issuecomment-1264327655

   
   ## CI report:
   
   * 6318c29ab4e0415d365db365aac92f1f562a4b68 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11954)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-4964) Refactor HoodieWriteConfig to make sure we have a uniform and better way to access config value

2022-10-01 Thread Jian Feng (Jira)

Jian Feng created HUDI-4964:
---

 Summary: Refactor HoodieWriteConfig to make sure we have a uniform 
and better way to access config value
 Key: HUDI-4964
 URL: https://issues.apache.org/jira/browse/HUDI-4964
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Jian Feng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4964) Refactor HoodieWriteConfig to make sure we have a uniform and better way to access config value

2022-10-01 Thread Jian Feng (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Feng updated HUDI-4964:

Description: 
 I feel like this class is a hodgepodge and is ballooning. It does not 
demonstrate the benefits of configuration separation that a pluggable 
architecture should bring. 

if we keep wrap config.getString(ConfigKey) with a getter method such as 
config.get{$ConfigKey}() in HoodieWriteConfig, basically we need to include 
almost every configKey in every config(such as 
HoodieClusteringConfig,HoodieHBaseIndexConfig,HoodieIndexConfig,HoodieStorageConfig
 and etc) to HoodieWriteConfig

for the consistent manner and clean code purpose, I think we should inline all 
the getter methods that have no logic  

  was:
 I feel like this class is a hodgepodge and is ballooning. It does not 
demonstrate the benefits of configuration separation that a pluggable 
architecture should bring. 

if we keep wrap config.getString(ConfigKey) with a getter method such as 
config.get{$ConfigKey}() in HoodieWriteConfig, basically we need to include 
almost every configKey in every config(such as 
HoodieClusteringConfig,HoodieHBaseIndexConfig,HoodieIndexConfig,HoodieStorageConfig
 and etc) to HoodieWriteConfig

for the consistent manner and clean code purpose, I think we should inline all 
the getter methods that has no logic  


> Refactor HoodieWriteConfig to make sure we have a uniform and better way to 
> access config value
> ---
>
> Key: HUDI-4964
> URL: https://issues.apache.org/jira/browse/HUDI-4964
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jian Feng
>Priority: Major
>
>  I feel like this class is a hodgepodge and is ballooning. It does not 
> demonstrate the benefits of configuration separation that a pluggable 
> architecture should bring. 
> if we keep wrap config.getString(ConfigKey) with a getter method such as 
> config.get{$ConfigKey}() in HoodieWriteConfig, basically we need to include 
> almost every configKey in every config(such as 
> HoodieClusteringConfig,HoodieHBaseIndexConfig,HoodieIndexConfig,HoodieStorageConfig
>  and etc) to HoodieWriteConfig
> for the consistent manner and clean code purpose, I think we should inline 
> all the getter methods that have no logic  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4964) Refactor HoodieWriteConfig to make sure we have a uniform and better way to access config value

2022-10-01 Thread Jian Feng (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Feng updated HUDI-4964:

Description: 
 I feel like this class is a hodgepodge and is ballooning. It does not 
demonstrate the benefits of configuration separation that a pluggable 
architecture should bring. 

if we keep wrap config.getString(ConfigKey) with a getter method such as 
config.get{$ConfigKey}() in HoodieWriteConfig, basically we need to include 
almost every configKey in every config(such as 
HoodieClusteringConfig,HoodieHBaseIndexConfig,HoodieIndexConfig,HoodieStorageConfig
 and etc) to HoodieWriteConfig

for the consistent manner and clean code purpose, I think we should inline all 
the getter methods that has no logic  

> Refactor HoodieWriteConfig to make sure we have a uniform and better way to 
> access config value
> ---
>
> Key: HUDI-4964
> URL: https://issues.apache.org/jira/browse/HUDI-4964
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jian Feng
>Priority: Major
>
>  I feel like this class is a hodgepodge and is ballooning. It does not 
> demonstrate the benefits of configuration separation that a pluggable 
> architecture should bring. 
> if we keep wrap config.getString(ConfigKey) with a getter method such as 
> config.get{$ConfigKey}() in HoodieWriteConfig, basically we need to include 
> almost every configKey in every config(such as 
> HoodieClusteringConfig,HoodieHBaseIndexConfig,HoodieIndexConfig,HoodieStorageConfig
>  and etc) to HoodieWriteConfig
> for the consistent manner and clean code purpose, I think we should inline 
> all the getter methods that has no logic  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] gtwuser opened a new issue, #6849: [SUPPORT]Caused by: org.apache.hudi.exception.HoodieException: org.apache.avro.SchemaParseException: Can't redefine: element

2022-10-01 Thread GitBox



gtwuser opened a new issue, #6849:
URL: https://github.com/apache/hudi/issues/6849

   
   **Describe the problem you faced**
   Issue 
   If same record is saved to Hudi table via upsert operation it fails with 
error stating `org.apache.avro.SchemaParseException: Can't redefine: element`. 
   A clear and concise description of the problem.
   On resend of the same json record which was successfully saved earlier in 
the same table, hudi throws back an error 
`org.apache.avro.SchemaParseException: Can't redefine: element`. 
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Please find the json file by unzipping it from the attached zip file 
(https://github.com/apache/hudi/files/9690116/b71f7f30-079f-43bf-8cc5-41da312db421_1661730367433716761_11.json.zip).
 Create a data frame from this file 
   ```bash
   df = spark.read.option('dropFieldIfAllNull', True).option("mergeSchema", 
"true").json(unzipped_json_file).
   withColumn('year', F.year(F.current_timestamp())).
   withColumn('month', F.month(F.current_timestamp())).
   withColumn('day', F.dayofmonth(F.current_timestamp())).
   withColumn('hour', F.hour(F.current_timestamp())).
   withColumn('qt_id', lit('abcd'))
   ```
   Schema of the file is :
   ```bash
   |-- metadata: struct (nullable = true)
||-- contentType: string (nullable = true)
||-- myID: string (nullable = true)
||-- mytopic: string (nullable = true)
|-- payload: struct (nullable = true)
||-- Details: struct (nullable = true)
|||-- MyInfo: struct (nullable = true)
||||-- myRoleModel: string (nullable = true)
|||-- variations: struct (nullable = true)
||||-- items: array (nullable = true)
|||||-- element: struct (containsNull = true)
||||||-- type: string (nullable = true)
||||||-- veriety: string (nullable = true)
||-- myAddress: struct (nullable = true)
|||-- address: array (nullable = true)
||||-- element: struct (containsNull = true)
|||||-- blocks: struct (nullable = true)
||||||-- quater: array (nullable = true)
|||||||-- element: struct (containsNull = true)
||||||||-- count: long (nullable = true)
||||||||-- max_occupancy: long (nullable = true)
||||||||-- querter_number: long (nullable = 
true)
||||||||-- size: long (nullable = true)
|||||-- variations: struct (nullable = true)
||||||-- items: array (nullable = true)
|||||||-- element: struct (containsNull = true)
||||||||-- type: string (nullable = true)
||||||||-- veriety: string (nullable = true)
||-- quaterType: string (nullable = true)
||-- quaterVeriant: string (nullable = true)
||-- recordedAt: long (nullable = true)
|-- veriant: string (nullable = true)
|-- is_deleted: boolean (nullable = false)
   ```
   
   2. use below configs to write it to Hudi 
   ```bash
commonConfig = {
   'className': 'org.apache.hudi',
   'hoodie.datasource.hive_sync.use_jdbc': 'false',
   'hoodie.datasource.write.precombine.field': 
'payload.recordedAt',
   'hoodie.datasource.write.recordkey.field': 
'metadata.myID,metadata.mytopic',
   'hoodie.table.name': 'mydetails',
   'hoodie.datasource.hive_sync.database': 
args['database_name'],
   'hoodie.datasource.write.reconcile.schema': 'true',
   'hoodie.datasource.hive_sync.table': 
f'mydetails_{"_".join(prefix.split("/")[-7:-5])}'.lower(),
   'hoodie.datasource.hive_sync.enable': 'true',
   'path': 's3://' + args['curated_bucket'] + 
'/merged/mydetails/' + f'{prefix.split("/")[-7]}'.lower(),
   'hoodie.parquet.small.file.limit': '307200',
   'hoodie.parquet.max.file.size': '12800'
   }
   partitionDataConfig = {
   'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
   'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.CustomKeyGenerator',
   'hoodie.datasource.write.partitionpath.field': 'year:SIMPLE, 
month:SIMPLE, day:SIMPLE, hour:SIMPLE, qt_id:SIMPLE',
   'hoodie.datasource.hive_sync.partition_fields': 'year, 
month, day, hour, device_id',
   'hoodie.datasource.write.hive_style_partitioning': 'true'
   }
   initLoadConfig = {'hoodie.bulkinsert.shuffle.parallelism': 68,

[GitHub] [hudi] gtwuser commented on issue #6849: [SUPPORT]Caused by: org.apache.hudi.exception.HoodieException: org.apache.avro.SchemaParseException: Can't redefine: element

2022-10-01 Thread GitBox



gtwuser commented on issue #6849:
URL: https://github.com/apache/hudi/issues/6849#issuecomment-1264334459

   @nsivabalan @alexeykudinkin @yihua @xiarixiaoyao any help will be highly 
useful


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] fengjian428 opened a new pull request, #6850: [HUDI-4964] inline inline all the getter methods that have no logic …

2022-10-01 Thread GitBox



fengjian428 opened a new pull request, #6850:
URL: https://github.com/apache/hudi/pull/6850

   …in HoodieWriteConfig
   
   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4964) Refactor HoodieWriteConfig to make sure we have a uniform and better way to access config value

2022-10-01 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4964:
-
Labels: pull-request-available  (was: )

> Refactor HoodieWriteConfig to make sure we have a uniform and better way to 
> access config value
> ---
>
> Key: HUDI-4964
> URL: https://issues.apache.org/jira/browse/HUDI-4964
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jian Feng
>Priority: Major
>  Labels: pull-request-available
>
>  I feel like this class is a hodgepodge and is ballooning. It does not 
> demonstrate the benefits of configuration separation that a pluggable 
> architecture should bring. 
> if we keep wrap config.getString(ConfigKey) with a getter method such as 
> config.get{$ConfigKey}() in HoodieWriteConfig, basically we need to include 
> almost every configKey in every config(such as 
> HoodieClusteringConfig,HoodieHBaseIndexConfig,HoodieIndexConfig,HoodieStorageConfig
>  and etc) to HoodieWriteConfig
> for the consistent manner and clean code purpose, I think we should inline 
> all the getter methods that have no logic  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] gtwuser commented on issue #6849: [SUPPORT]Caused by: org.apache.hudi.exception.HoodieException: org.apache.avro.SchemaParseException: Can't redefine: element

2022-10-01 Thread GitBox



gtwuser commented on issue #6849:
URL: https://github.com/apache/hudi/issues/6849#issuecomment-1264337473

   This https://github.com/apache/hudi/issues/1326 may be similar to the above 
issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] gtwuser commented on issue #5891: How to read existing hoodie written data from `S3` using `AWS Glue DynamicFrame` class. Failing with error with below error: An error occurred while

2022-10-01 Thread GitBox



gtwuser commented on issue #5891:
URL: https://github.com/apache/hudi/issues/5891#issuecomment-1264338119

   thanks @jharringtonCoupons closing it for now. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] gtwuser closed issue #5891: How to read existing hoodie written data from `S3` using `AWS Glue DynamicFrame` class. Failing with error with below error: An error occurred while callin

2022-10-01 Thread GitBox



gtwuser closed issue #5891: How to read existing hoodie written data from `S3` 
using `AWS Glue DynamicFrame` class. Failing with error  with below error: An 
error occurred while calling o84.getDynamicFrame. 
s3:///.hoodie/202212312312.commit is not a Parquet file. expected magic 
number at tail 
URL: https://github.com/apache/hudi/issues/5891


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4952) Reading from metadata table could fail when there are no completed commits

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4952:
--
Fix Version/s: 0.12.2
   (was: 0.12.1)

> Reading from metadata table could fail when there are no completed commits
> --
>
> Key: HUDI-4952
> URL: https://issues.apache.org/jira/browse/HUDI-4952
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>
> When metadata table is just getting initialized, but first commit is not yet 
> fully complete, reading from metadata table could fail w/ below stacktrace. 
>  
> {code:java}
> 22/08/20 02:56:58 ERROR client.RemoteDriver: Failed to run client job 
> 39d720db-b15d-4823-b8b1-54398b143d6e
> org.apache.hudi.exception.HoodieException: Error fetching partition paths 
> from metadata table
> at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:315)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:176)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.loadPartitionPathFiles(BaseHoodieTableFileIndex.java:219)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.doRefresh(BaseHoodieTableFileIndex.java:264)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.(BaseHoodieTableFileIndex.java:139)
> at 
> org.apache.hudi.hadoop.HiveHoodieTableFileIndex.(HiveHoodieTableFileIndex.java:49)
> at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:234)
> at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:141)
> at 
> org.apache.hudi.hadoop.HoodieParquetInputFormatBase.listStatus(HoodieParquetInputFormatBase.java:90)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.listStatus(HoodieCombineHiveInputFormat.java:889)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217)
> at 
> org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:76)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getSplits(HoodieCombineHiveInputFormat.java:942)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getCombineSplits(HoodieCombineHiveInputFormat.java:241)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getSplits(HoodieCombineHiveInputFormat.java:363)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> at org.apache.spark.rdd.RDD.getNumPartitions(RDD.scala:267)
> at 
> org.apache.spark.api.java.JavaRDDLike$class.getNumPartitions(JavaRDDLike.scala:65)
> at 
> org.apache.spark.api.java.AbstractJavaRDDLike.getNumPartitions(JavaRDDLike.scala:45)
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateMapInput(SparkPlanGenerator.java:252)
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateParentTran(SparkPlanGenerator.java:179)
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:130)
> at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:355)
> at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:400)
> at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:365)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to 
> retrieve list of partition from metadata
> at 
> org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:113)
> at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:313)
> ... 32 more
> Caused by: java.util.NoSuchElementException: No value present in Option
> at org.apache.hudi.common.util.Option.get(Option.java:89)
> at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionFileSlices(HoodieTableMetadataUtil.java:1057)
> at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(HoodieTableMetada

[jira] [Resolved] (HUDI-4951) An incorrect use of the Long method

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu resolved HUDI-4951.
---

> An incorrect use of the Long method
> ---
>
> Key: HUDI-4951
> URL: https://issues.apache.org/jira/browse/HUDI-4951
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.11.0, 0.11.1, 0.12.0
>Reporter: shenshengli
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> BaseHoodieWriteClient.java 
> {code:java}
> extraMeta.put(SerDeHelper.LATEST_SCHEMA, 
> SerDeHelper.toJson(newSchema.setSchemaId(Long.getLong(instantTime;{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HUDI-4957) Shade JOL in every bundle

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu resolved HUDI-4957.
---

> Shade JOL in every bundle
> -
>
> Key: HUDI-4957
> URL: https://issues.apache.org/jira/browse/HUDI-4957
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Bundles where it is not shaded
> hudi-aws-bundle
> hudi-datahub-sync-bundle
> hudi-gcp-bundle
> hudi-timeline-server-bundle
> hudi-utilities-slim-bundle



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HUDI-4936) as.of.instant not recognized as hoodie config

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu resolved HUDI-4936.
---

> as.of.instant not recognized as hoodie config
> -
>
> Key: HUDI-4936
> URL: https://issues.apache.org/jira/browse/HUDI-4936
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> original issue
> https://github.com/apache/hudi/issues/5586



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4922) Presto query of bootstrapped data returns null

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4922:
--
Fix Version/s: 0.12.2
   (was: 0.12.1)

>  Presto query of bootstrapped data returns null
> ---
>
> Key: HUDI-4922
> URL: https://issues.apache.org/jira/browse/HUDI-4922
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.12.2
>
>
> https://github.com/apache/hudi/issues/6532



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HUDI-4924) Dedup parallelism is not auto tuned based on input

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu resolved HUDI-4924.
---

> Dedup parallelism is not auto tuned based on input
> --
>
> Key: HUDI-4924
> URL: https://issues.apache.org/jira/browse/HUDI-4924
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> This causes some tests like 
> testToWriteWithoutParametersIncludedInHoodieTableConfig taking longer time to 
> finish than expected.  For this particular one, reducing the upsert shuffle 
> parallelism from default 200 to 2 makes the test finish in 15s instead of 90s 
> locally.
>  
> The actual root cause of the problem is that the dedup parallelism is taken 
> directly from the Hudi write shuffle parallelism, without auto-tuning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4894) Fix ClassCastException when using fixed type defining decimal column

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4894:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Fix ClassCastException when using fixed type defining decimal column
> 
>
> Key: HUDI-4894
> URL: https://issues.apache.org/jira/browse/HUDI-4894
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: Xianghu Wang
>Assignee: Xianghu Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> schema for decimal column :
> {code:java}
> {
>     "name": "column_name",
>     "type": ["null", {
>         "type": "fixed",
>         "name": "fixed",
>         "size": 5,
>         "logicalType": "decimal",
>         "precision": 10,
>         "scale": 2
>     }],
>     "default": null
> }{code}
>  
> exception:
> Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to 
> java.util.List
> at 
> org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254)
> at 
> org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151)
> at 
> org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140)
> at 
> org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107)
> at 
> org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96)
> at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4915) Spark Avro SerDe returns wrong result upon multiple calls

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4915:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Spark Avro SerDe returns wrong result upon multiple calls
> -
>
> Key: HUDI-4915
> URL: https://issues.apache.org/jira/browse/HUDI-4915
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently, Spark Avro serializer/deserializer has a bug that it will return 
> the same object when we call this method twice continuously.  For example:
> val row1: InternalRow = ...
> val row2: InternalRow = ... // record2 is different with record1
>  
> val serializeredRecord1 = serialize(row1)
> val serializeredRecord2 = serialize(row2)
> serializeredRecord1.equals(serializeredRecord2)
>  
> That is because we use the `val` to declare the serializer/deserializer 
> methods, so the latter's result will cover the previous one.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HUDI-4943) Benchmark JOL ClassLayout based object size estimator

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu resolved HUDI-4943.
---

> Benchmark JOL ClassLayout based object size estimator
> -
>
> Key: HUDI-4943
> URL: https://issues.apache.org/jira/browse/HUDI-4943
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.12.1
>
>
> [https://github.com/apache/hudi/pull/6657#pullrequestreview-1122590536]
> We recently changed the object size estimation method. We should 
> micro-benchmark both methods to ensure there is no perf regression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4922) Presto query of bootstrapped data returns null

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4922:
--
Fix Version/s: 0.13.0
   (was: 0.12.2)

>  Presto query of bootstrapped data returns null
> ---
>
> Key: HUDI-4922
> URL: https://issues.apache.org/jira/browse/HUDI-4922
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.13.0
>
>
> https://github.com/apache/hudi/issues/6532



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4854) Deltastreamer does not respect partition selector regex for metadata-only bootstrap

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4854:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Deltastreamer does not respect partition selector regex for metadata-only 
> bootstrap
> ---
>
> Key: HUDI-4854
> URL: https://issues.apache.org/jira/browse/HUDI-4854
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HUDI-4864) Fix AWSDmsAvroPayload during delete operations with MOR snapshot query

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu resolved HUDI-4864.
---

> Fix AWSDmsAvroPayload during delete operations with MOR snapshot query
> --
>
> Key: HUDI-4864
> URL: https://issues.apache.org/jira/browse/HUDI-4864
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Rahil Chertara
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4893) More than 1 splits are created for a single log file for MOR table

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4893:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> More than 1 splits are created for a single log file for MOR table
> --
>
> Key: HUDI-4893
> URL: https://issues.apache.org/jira/browse/HUDI-4893
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.13.0
>
>
> While debugging a flaky test, realized that we are generating more than 1 
> split for one log file itself. Root caused it to isSpllitable() that returns 
> true for HoodieRealTimePath. 
>  
> [https://github.com/apache/hudi/blob/6dbe2960f2eaf0408dc0ef544991cad0190050a9/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java#L91]
>  
> I made a quick fix locally and verified that only one split is generated per 
> log file. 
>  
> {code:java}
> git diff 
> hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> diff --git 
> a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
>  
> b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> index bba44d5c66..d09dfdf753 100644
> --- 
> a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> +++ 
> b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> @@ -89,7 +89,7 @@ public class HoodieRealtimePath extends Path {
>}
>  
>public boolean isSplitable() {
> -return !toString().isEmpty() && !includeBootstrapFilePath();
> +return !toString().contains(".log") && !includeBootstrapFilePath();
>}
>  
>public PathWithBootstrapFileStatus getPathWithBootstrapFileStatus() { 
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #6745: Fix comment in RFC46

2022-10-01 Thread GitBox



hudi-bot commented on PR #6745:
URL: https://github.com/apache/hudi/pull/6745#issuecomment-1264344080

   
   ## CI report:
   
   * b7a71b7de263c5117e199df45cba0ead191159a3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11938)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11952)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Resolved] (HUDI-4913) HoodieSnapshotExporter throws IllegalArgumentException: Wrong FS

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu resolved HUDI-4913.
---

> HoodieSnapshotExporter throws IllegalArgumentException: Wrong FS
> 
>
> Key: HUDI-4913
> URL: https://issues.apache.org/jira/browse/HUDI-4913
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When using the HoodieSnapshotExporter to export a Hudi dataset on S3 to a 
> different bucket, i.e., the source-base-path and the target-output-path are 
> in different buckets, IllegalArgumentException is thrown:
>  
> {code:java}
> ./bin/spark-submit \
>   --master yarn \
>   --deploy-mode client \
>   --driver-memory 10g \
>   --executor-memory 10g \
>   --num-executors 1 \
>   --executor-cores 4 \
>   --jars /home/hadoop/hudi-spark3.2-bundle_2.12-0.13.0-SNAPSHOT.jar \
>   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>   --conf spark.kryoserializer.buffer=256m \
>   --conf spark.kryoserializer.buffer.max=1024m \
>   --conf spark.rdd.compress=true \
>   --conf spark.memory.storageFraction=0.8 \
>   --conf "spark.driver.defaultJavaOptions=-XX:+UseG1GC" \
>   --conf "spark.executor.defaultJavaOptions=-XX:+UseG1GC" \
>   --conf spark.ui.proxyBase="" \
>   --conf 'spark.eventLog.enabled=true' --conf 
> 'spark.eventLog.dir=hdfs:///var/log/spark/apps' \
>   --conf spark.hadoop.yarn.timeline-service.enabled=false \
>   --conf spark.driver.userClassPathFirst=true \
>   --conf spark.executor.userClassPathFirst=true \
>   --conf "spark.sql.hive.convertMetastoreParquet=false" \
>   --conf spark.sql.catalogImplementation=in-memory \
>   --conf 
> 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
>   --conf 
> 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
>  \
>   --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
>       /home/hadoop/hudi-utilities-bundle_2.12-0.13.0-SNAPSHOT.jar \
>   --source-base-path "s3a://ethan-lakehouse-us-east-2/hudi/hudi_trips_cow/" \
>   --target-output-path "s3a://ethan-tmp/backup/" \
>   --output-format "hudi"{code}
>   
> {code:java}
> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS 
> s3a://ethan-tmp//backup -expected s3a://ethan-lakehouse-us-east-2
>     at 
> org.apache.hadoop.fs.s3native.S3xLoginHelper.checkPath(S3xLoginHelper.java:224)
>     at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.checkPath(S3AFileSystem.java:1155)
>     at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:666)
>     at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.makeQualified(S3AFileSystem.java:1117)
>     at org.apache.hadoop.fs.s3a.S3AFileSystem.qualify(S3AFileSystem.java:1143)
>     at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3078)
>     at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:3053)
>     at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1760)
>     at org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:4263)
>     at 
> org.apache.hudi.utilities.HoodieSnapshotExporter.outputPathExists(HoodieSnapshotExporter.java:145)
>     at 
> org.apache.hudi.utilities.HoodieSnapshotExporter.export(HoodieSnapshotExporter.java:120)
>     at 
> org.apache.hudi.utilities.HoodieSnapshotExporter.main(HoodieSnapshotExporter.java:275)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>     at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
>     at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>     at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>     at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>     at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
>     at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
>     at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala){code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4921) Fix last completed commit in CleanPlanner

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4921:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Fix last completed commit in CleanPlanner
> -
>
> Key: HUDI-4921
> URL: https://issues.apache.org/jira/browse/HUDI-4921
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Recently we added last completed commit in as part of clean commit metadata. 
> ideally the value should represent the last completed commit in timeline 
> before er which there are no inflight commits. but we just get the last 
> completed commit in active timeline and setting the value. 
> this needs fixing. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4911) Make sure LogRecordReader doesn't flush the cache before each lookup

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4911:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Make sure LogRecordReader doesn't flush the cache before each lookup
> 
>
> Key: HUDI-4911
> URL: https://issues.apache.org/jira/browse/HUDI-4911
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently {{HoodieMetadataMergedLogRecordReader }}will flush internal record 
> cache before each lookup which makes every lookup essentially do 
> re-processing of the whole log-blocks stack again.
> We should avoid that and only do the re-parsing incrementally (for the keys 
> that ain't already cached)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HUDI-4873) Report number of messages from AvroKafkaSource to be processed via metrics

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu resolved HUDI-4873.
---

> Report number of messages from AvroKafkaSource to be processed via metrics
> --
>
> Key: HUDI-4873
> URL: https://issues.apache.org/jira/browse/HUDI-4873
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Volodymyr Burenin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HUDI-4870) Improve compaction config description

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu resolved HUDI-4870.
---

> Improve compaction config description
> -
>
> Key: HUDI-4870
> URL: https://issues.apache.org/jira/browse/HUDI-4870
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4834) Update AWSGlueCatalog syncing oage to add spark datasource example

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4834:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Update AWSGlueCatalog syncing oage to add spark datasource example
> --
>
> Key: HUDI-4834
> URL: https://issues.apache.org/jira/browse/HUDI-4834
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs
>Reporter: Bhavani Sudha
>Assignee: Raymond Xu
>Priority: Minor
>  Labels: documentation
> Fix For: 0.13.0
>
>
> [https://hudi.apache.org/docs/next/syncing_aws_glue_data_catalog] this page 
> specifically talks about how to leverage this syncing mechanism via 
> Deltastreamer. We also need example for spark datasource here. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HUDI-4833) Add Postgres Schema Name to Postgres Debezium Source

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu resolved HUDI-4833.
---

> Add Postgres Schema Name to Postgres Debezium Source
> 
>
> Key: HUDI-4833
> URL: https://issues.apache.org/jira/browse/HUDI-4833
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HUDI-4837) Stop sleeping where it is not necessary in Kafka source

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu resolved HUDI-4837.
---

> Stop sleeping where it is not necessary in Kafka source
> ---
>
> Key: HUDI-4837
> URL: https://issues.apache.org/jira/browse/HUDI-4837
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4816) Update asf-site docs for GlobalDeleteKeyGenerator

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4816:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Update asf-site docs for GlobalDeleteKeyGenerator
> -
>
> Key: HUDI-4816
> URL: https://issues.apache.org/jira/browse/HUDI-4816
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.13.0
>
>
> The GlobalDeleteKeyGenerator should be used with a global index to delete 
> records based on the record key solely and it works for a batch with deletes 
> only.  The key generator can be used for both partitioned and non-partitioned 
> table.  Note that when using GlobalDeleteKeyGenerator, the config 
> hoodie.[bloom|simple|hbase].index.update.partition.path should be set to 
> false to avoid redundant data written to the storage.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4787) ITTestHoodieSanity#testRunHoodieJavaAppOnMultiPartitionKeysMORTable streaming test

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4787:
--
Fix Version/s: 0.13.0

> ITTestHoodieSanity#testRunHoodieJavaAppOnMultiPartitionKeysMORTable streaming 
> test
> --
>
> Key: HUDI-4787
> URL: https://issues.apache.org/jira/browse/HUDI-4787
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Critical
>  Labels: hudi-on-call
> Fix For: 0.13.0
>
>
> After [https://github.com/apache/hudi/pull/6098] this change in streaming 
> sink, testRunHoodieJavaAppOnMultiPartitionKeysMORTable fails intermittently. 
> Upon debugging, I found that the thread of delete operation in 
> HoodieJavaStreamingApp#run has not yet completed but the validation starts. 
> The flakiness could be due to extra time that for reading checkpoint metadata 
> in each microbatch (due to #6098). A better way would be to rewrite this 
> test. I am not sure why delete is even required.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4787) ITTestHoodieSanity#testRunHoodieJavaAppOnMultiPartitionKeysMORTable streaming test

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4787:
--
Fix Version/s: (was: 0.12.1)

> ITTestHoodieSanity#testRunHoodieJavaAppOnMultiPartitionKeysMORTable streaming 
> test
> --
>
> Key: HUDI-4787
> URL: https://issues.apache.org/jira/browse/HUDI-4787
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Critical
>  Labels: hudi-on-call
>
> After [https://github.com/apache/hudi/pull/6098] this change in streaming 
> sink, testRunHoodieJavaAppOnMultiPartitionKeysMORTable fails intermittently. 
> Upon debugging, I found that the thread of delete operation in 
> HoodieJavaStreamingApp#run has not yet completed but the validation starts. 
> The flakiness could be due to extra time that for reading checkpoint metadata 
> in each microbatch (due to #6098). A better way would be to rewrite this 
> test. I am not sure why delete is even required.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4777) Flink gen bucket index of mor table not consistent with spark lead to duplicate bucket issue

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4777:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Flink gen bucket index of mor table not consistent with spark lead to 
> duplicate bucket issue
> 
>
> Key: HUDI-4777
> URL: https://issues.apache.org/jira/browse/HUDI-4777
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: JinxinTang
>Assignee: JinxinTang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4755) INSERT_OVERWRITE(/TABLE) in spark sql should not fail time travel queries for older timestamps

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4755:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> INSERT_OVERWRITE(/TABLE) in spark sql should not fail time travel queries for 
> older timestamps
> --
>
> Key: HUDI-4755
> URL: https://issues.apache.org/jira/browse/HUDI-4755
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: sivabalan narayanan
>Assignee: XiaoyuGeng
>Priority: Critical
>  Labels: hudi-on-call
> Fix For: 0.13.0
>
>
> when INSERT_OVERWRITE or INSERT_OVERWRITE_TABLE is used in spark-sql, we 
> should still support time travel queries for older timestamps. 
>  
> Ref issue: https://github.com/apache/hudi/issues/6452



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4741) Deadlock when restarting failed TM in AbstractStreamWriteFunction

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4741:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Deadlock when restarting failed TM in AbstractStreamWriteFunction
> -
>
> Key: HUDI-4741
> URL: https://issues.apache.org/jira/browse/HUDI-4741
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Reporter: voon
>Assignee: voon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: coordinator_operator_review-old operator.png
>
>
> h1. Summary of Events
>  # TM heartbeat not sent to JM (Can be triggered by killing a container), JM 
> kills the TM/container
>  # JM restarts the container, but the restarting code is not handled 
> properly, causing there to be a deadlock
>  # Deadlock causes instantToWrite() to loop for 10 minutes (default Flink 
> checkpoint timeout), causing a instant initialization timeout error
>  # JM is restarted
>  # JM restore state from previously successful checkpoint
>  # Issue in HUDI-4907 occurs
>  
> h1. Code for reproducing
> h2. Flink SQL Code
> {code:java}
> CREATE TABLE input_table (
>     `val`               STRING
>     ,`event_time`       TIMESTAMP(3)
>     ,`partition`        BIGINT
>     ,`offset`           BIGINT
> ) WITH (
>     'connector' = 'datagen',
>     'fields.val.length' = '9',
>     'rows-per-second' = '15000'
> );CREATE TABLE test_hudi
> (
>     `val`                 STRING
>     ,`event_time`       TIMESTAMP(3)
>     ,`partition`        BIGINT
>     ,`offset`           BIGINT
>     ,`dt`               STRING
>     ,`hh`               STRING
> ) PARTITIONED BY (dt, hh)
> WITH (
>     'connector' = 'hudi',
>     'path' = 'hdfs://jm_tm_sync_error/',
>     'table.type' = 'COPY_ON_WRITE',
>     'write.operation' = 'insert',
>     'hoodie.parquet.small.file.limit' = '104857600',
>     'hoodie.parquet.max.file.size' = '268435456',
>     'hoodie.datasource.write.recordkey.field' = 'partition,offset',
>     'hoodie.datasource.write.hive_style_partitioning' = 'true',
>     'hoodie.datasource.write.partitionpath.field' = 'dt,hh',
>     'write.bulk_insert.sort_input' = 'false',
>     'index.bootstrap.enabled' = 'false',
>     'index.state.ttl' = '60',
>     'index.type' = 'FLINK_STATE',
>     'hoodie.datasource.write.keygenerator.class' = 
> 'org.apache.hudi.keygen.ComplexAvroKeyGenerator',
>     'write.tasks' = '8',
>     'hive_sync.enable' = 'false'
> );insert into test_hudi
> select  `val`
>         ,`event_time`
>         ,`partition`
>         ,`offset`
>         ,DATE_FORMAT(event_time, '-MM-dd')
>         ,DATE_FORMAT(event_time, 'HH')
>  from input_table; {code}
>  
> h2. Advanced Properties
> {code:java}
> execution.checkpointing.interval=6ms {code}
>  
> h2. Job Profile Properties
> {code:java}
> flink.version=1.13.14
> default.parallelism=8
> restart.from.savepoint=true
> sql.job.mode=normal
> running.mode=streaming
> slots.per.tm=2
> cpu.per.tm=2vcore
> memory.per.tm=6G
> jvm.heap.ratio=70% {code}
>  
>  
> h1. Issues: TM failing + starting a TM in a new container causing deadlock 
>  # When a TM fails + starting and restoring a TM in a new container creates a 
> deadlock situation
>  ** TM is waiting for JM to create a new _INFLIGHT_ instant, and the
>  ** JM is waiting for TM to send a success WriteMetadataEvent
>  # The deadlock above will cause either of the errors below:
>  ** org.apache.hudi.exception.HoodieException: Timeout(601000ms) while 
> waiting for instant initialize
>  ** org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint 
> expired before completing.
>  # This will trigger org.apache.flink.runtime.jobmaster.JobMaster [] - Trying 
> to recover from a global failure.
>  # JM will try to restore itself from the last successful checkpoint
>  # This will cause HUDI-4907
> h2. Root cause
> When restoring the TM, *`AbstractStreamWriteFunction#initializeState()`* will 
> attempt to restore the state of the TM. At this stage, 
> *`this.currentInstant`* will be initialized by invoking 
> {*}`lastPendingInstant()`{*}, in which the ckp metadata path will be loaded 
> and a _INFLIGHT_ instant is returned.
>  
> When invoking {*}`instantToWrite()`{*}, 
> *`instant.equals(this.currentInstant)`* will always be true as the local 
> *`instant`* is equal to {*}`this.currentInstant`{*}. Hence, the current 
> implementation will be stuck in an infinite loop as 
> {*}`lastPendingInstant()`{*}, which governs both *`instant`* and 
> *`this.currentInstant`* will always return the same value as the state of the 
> ckp metadata path is never changed. 
>  
> This is so as JM is waiting for the TM to finish writing for the batch for 
> the _INFLIGHT_ instant.

[jira] [Updated] (HUDI-4733) Flag emitDelete is inconsistent in HoodieTableSource and MergeOnReadInputFormat

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4733:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Flag emitDelete is inconsistent in HoodieTableSource and 
> MergeOnReadInputFormat
> ---
>
> Key: HUDI-4733
> URL: https://issues.apache.org/jira/browse/HUDI-4733
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink, flink-sql
>Reporter: nonggia.liang
>Assignee: Zhaojing Yu
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: image 1.png
>
>
> When reading a MOR table in flink, we encountered an exception from flink 
> runtime ( as shown in image1), which complained the table source should not 
> emit a retract record.
> !image 1.png!
> I think here is the cause, in HoodieTableSource:
> {code:java}
> @Override
> public ChangelogMode getChangelogMode() {
>   // when read as streaming and changelog mode is enabled, emit as FULL mode;
>   // when all the changes are compacted or read as batch, emit as INSERT mode.
>   return OptionsResolver.emitChangelog(conf) ? ChangelogModes.FULL : 
> ChangelogMode.insertOnly();
> } {code}
> {code:java}
> private InputFormat getStreamInputFormat() { 
> ...
> if (FlinkOptions.QUERY_TYPE_SNAPSHOT.equals(queryType)) { 
>   final HoodieTableType tableType = 
> HoodieTableType.valueOf(this.conf.getString(FlinkOptions.TABLE_TYPE)); 
>   boolean emitDelete = tableType == HoodieTableType.MERGE_ON_READ; 
>   return mergeOnReadInputFormat(rowType, requiredRowType, tableAvroSchema, 
> rowDataType, Collections.emptyList(), emitDelete); }
> ...
>  }
> {code}
> With these options:
> {{'table.type'}} {{= }}{{'MERGE_ON_READ'}}
> {{'read.streaming.enabled'}} {{= }}{{'true'}}
> {{The HoodieTableSource}} annouces it has only INSERT changelog, 
> but MergeOnReadInputFormat will emit delete.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4735) Spark2 bundles made from master after 2022-07-23 failed to stop

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4735:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Spark2 bundles made from master after 2022-07-23 failed to stop
> ---
>
> Key: HUDI-4735
> URL: https://issues.apache.org/jira/browse/HUDI-4735
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive, spark, spark-sql
>Reporter: Vitali Obukh
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.13.0
>
>
> Daily Spark2 Structured streaming application built against the latest change 
> on master started failing on 2022-07-23 with the error:
> {quote}
> 22/07/28 10:52:50 ERROR org.apache.hudi.HoodieStreamingSink: Micro batch id=2 
> threw following exception: 
> org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 
> 'test' not found;
> {quote}
> Actually the database exists (created by the Hudi). Spark3 application with 
> the same code works fine.
> Hudi options used for writing:
> {code}
>  Map("hoodie.datasource.write.operation" -> "upsert",
>   "hoodie.datasource.write.table.type" -> "MERGE_ON_READ",
>   "hoodie.datasource.write.recordkey.field" -> s"$Meta.$Key",
>   "hoodie.datasource.write.precombine.field" -> s"$Meta.$Version",
>   "hoodie.datasource.write.payload.class" -> 
> "com.walmart.udps.hudi.integration.HudiTestAvroPayload",
>   "hoodie.compaction.payload.class" -> 
> "com.walmart.udps.hudi.integration.HudiTestAvroPayload",
>   "hoodie.datasource.write.partitionpath.field" -> TopLevelPartition,
>   "hoodie.datasource.write.hive_style_partitioning" -> "true",
>   "hoodie.datasource.hive_sync.enable" -> "true",
>   "hoodie.datasource.hive_sync.mode" -> "hms",
>   "hoodie.datasource.hive_sync.database" -> dbName(),
>   "hoodie.datasource.hive_sync.partition_fields" -> TopLevelPartition,
>   "hoodie.datasource.compaction.async.enable" -> "false"
> )
> {code}
> Cloud platform - Google Dataproc. 
> Spark version - 3.1.3.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4792) Speed up cleaning with metadata table enabled

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4792:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Speed up cleaning with metadata table enabled 
> --
>
> Key: HUDI-4792
> URL: https://issues.apache.org/jira/browse/HUDI-4792
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cleaning
>Reporter: nicolas paris
>Assignee: nicolas paris
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently fetching file group to be deleted is parallelized over each 
> partition. As a result, in case of many partition, many calls are made on the 
> metadata. While this is ok for file system view, this is highly inefficient 
> with the metadata table view (MDT){*}.{*} Likely each call makes the MoR 
> happens on the MDT and in the case of thousand of partitions the process is 
> incredibly slow. 
> I benchmarked (non incremental) cleaning on the same table w/ and w/o MDT on 
> a 40k partitionned hudi table :
>  * w/ MDT: 5 hours
>  * w/o MDT: 5 minutes
> This slowness makes the use of MDT not reasonable in the case of many 
> partitions, because cleaning is a  must-have.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4764) AwsglueSync turn already exist error into warning

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4764:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> AwsglueSync turn already exist error into warning
> -
>
> Key: HUDI-4764
> URL: https://issues.apache.org/jira/browse/HUDI-4764
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: nicolas paris
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> In some condition (OCC?) the AWSGlueCatalogSyncClient fails with already 
> exist exception for partition. In any case, if a given partition exist this 
> should not lead to fail the sync, but raise a warning  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #6850: [Draft][HUDI-4964] inline all the getter methods that have no logic …

2022-10-01 Thread GitBox



hudi-bot commented on PR #6850:
URL: https://github.com/apache/hudi/pull/6850#issuecomment-1264346752

   
   ## CI report:
   
   * e3aef767db19eed24222f8fff89ae4c59d0799c2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6848: [HUDI-4769] Option read.streaming.skip_compaction skips delta commit

2022-10-01 Thread GitBox



hudi-bot commented on PR #6848:
URL: https://github.com/apache/hudi/pull/6848#issuecomment-1264346736

   
   ## CI report:
   
   * 6318c29ab4e0415d365db365aac92f1f562a4b68 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11954)
 
   * fcfc1baee8aa5394c4472135f6ceb0d280cf8c16 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4704) bulk insert overwrite table will delete the table and then recreate a table

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4704:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> bulk insert overwrite table will delete the table and then recreate a table
> ---
>
> Key: HUDI-4704
> URL: https://issues.apache.org/jira/browse/HUDI-4704
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql, writer-core
>Affects Versions: 0.12.0
>Reporter: zouxxyy
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.13.0
>
>
> When hoodie.sql.bulk.insert.enable is enabled, executing insert overwrite 
> will delete the table and then recreate a table, so time travel cannot be 
> performed.
>  
> {code:java}
> create table hudi_cow_test_tbl (
>   id bigint,
>   name string,
>   ts bigint,
>   dt string,
>   hh string
> ) using hudi
> tblproperties (
>   type = 'cow',
>   primaryKey = 'id',
>   preCombineField = 'ts',
>   'hoodie.sql.insert.mode' = 'non-strict',
>   'hoodie.sql.bulk.insert.enable' = 'true'
> );
> insert into hudi_cow_test_tbl select 1, 'a1', 1001, '2021-12-09', '11';
> insert overwrite hudi_cow_test_tbl select 3, 'a3', 1001, '2021-12-09', '11';
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4688) Decouple lazy cleaning of failed writes from clean action in multi-writer

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4688:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Decouple lazy cleaning of failed writes from clean action in multi-writer
> -
>
> Key: HUDI-4688
> URL: https://issues.apache.org/jira/browse/HUDI-4688
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4631) Enhance retries for failed writes w/ write conflicts in a multi writer scenarios

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4631:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Enhance retries for failed writes w/ write conflicts in a multi writer 
> scenarios
> 
>
> Key: HUDI-4631
> URL: https://issues.apache.org/jira/browse/HUDI-4631
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: multi-writer
>Reporter: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.13.0
>
>
> lets say there are two writers from t0 to t5. so hudi fails w2 and succeeds 
> w1. and user restarts w2 and for next 5 mins, lets say there are no other 
> overlapping writers. So the same write from w2 will now succeed. so, whenever 
> there is a write conflict and pipeline fails, all user needs to do is, just 
> restart the pipeline or retry to ingest the same batch.
>  
> Ask: can we add retries within hudi during such failures. Anyways, in most 
> cases, users just restart the pipeline in such cases. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4674) change the default value of inputFormat for the MOR table

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4674:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> change the default value of inputFormat for the MOR table
> -
>
> Key: HUDI-4674
> URL: https://issues.apache.org/jira/browse/HUDI-4674
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: linfey.nie
>Assignee: linfey.nie
>Priority: Major
>  Labels: hudi-on-call, pull-request-available
> Fix For: 0.13.0
>
>
> When we build a mor table, for example with Sparksql,the default value of 
> inputFormat is HoodieParquetRealtimeInputFormat.but when use hive sync 
> metadata and skip the _ro suffix for Read,The inputFormat of the original 
> table name should be HoodieParquetInputFormat,but now is not.I think we 
> should change the default value of inputFormat,just like cow table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4586) Address S3 timeouts in Bloom Index with metadata table

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4586:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Address S3 timeouts in Bloom Index with metadata table
> --
>
> Key: HUDI-4586
> URL: https://issues.apache.org/jira/browse/HUDI-4586
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-08-15 at 17.39.01.png
>
>
> For partitioned table, there are significant number of S3 requests timeout 
> causing the upserts to fail when using Bloom Index with metadata table.
> {code:java}
> Load meta index key ranges for file slices: hudi
> collect at HoodieSparkEngineContext.java:137+details
> org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
> org.apache.hudi.client.common.HoodieSparkEngineContext.flatMap(HoodieSparkEngineContext.java:137)
> org.apache.hudi.index.bloom.HoodieBloomIndex.loadColumnRangesFromMetaIndex(HoodieBloomIndex.java:213)
> org.apache.hudi.index.bloom.HoodieBloomIndex.getBloomIndexFileInfoForPartitions(HoodieBloomIndex.java:145)
> org.apache.hudi.index.bloom.HoodieBloomIndex.lookupIndex(HoodieBloomIndex.java:123)
> org.apache.hudi.index.bloom.HoodieBloomIndex.tagLocation(HoodieBloomIndex.java:89)
> org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:49)
> org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:32)
> org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:53)
> org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:45)
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:113)
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:97)
> org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:155)
> org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206)
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:329)
> org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:183)
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>  {code}
> {code:java}
> org.apache.hudi.exception.HoodieException: Exception when reading log file 
>     at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:352)
>     at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:196)
>     at 
> org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader.getRecordsByKeys(HoodieMetadataMergedLogRecordReader.java:124)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.readLogRecords(HoodieBackedTableMetadata.java:266)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$getRecordsByKeys$1(HoodieBackedTableMetadata.java:222)
>     at java.util.HashMap.forEach(HashMap.java:1290)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:209)
>     at 
> org.apache.hudi.metadata.BaseTableMetadata.getColumnStats(BaseTableMetadata.java:253)
>     at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadColumnRangesFromMetaIndex$cc8e7ca2$1(HoodieBloomIndex.java:224)
>     at 
> org.apache.hudi.client.common.HoodieSparkEngineContext.lambda$flatMap$7d470b86$1(HoodieSparkEngineContext.java:137)
>     at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$flatMap$1(JavaRDDLike.scala:125)
>     at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>     at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>     at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
>

[jira] [Updated] (HUDI-4602) Implement getScheme for PrestoS3FileSystem

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4602:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Implement getScheme for PrestoS3FileSystem
> --
>
> Key: HUDI-4602
> URL: https://issues.apache.org/jira/browse/HUDI-4602
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Critical
> Fix For: 0.13.0
>
>
> In absence of this method, we get the below exception, unless we are running 
> on EMR (where EmrFileSystem gets picked as s3 file system type and that 
> implements this method already).
> {code:java}
> java.lang.UnsupportedOperationException: Not implemented by the 
> PrestoS3FileSystem FileSystem implementation
>   at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:219)
>   at 
> org.apache.hadoop.fs.HadoopExtendedFileSystem.getScheme(HadoopExtendedFileSystem.java:71)
>   at org.apache.hudi.common.fs.FSUtils.isGCSFileSystem(FSUtils.java:592)
>   at 
> org.apache.hudi.common.table.log.HoodieLogFileReader.getFSDataInputStream(HoodieLogFileReader.java:119)
>   at 
> org.apache.hudi.common.table.log.HoodieLogFileReader.(HoodieLogFileReader.java:95)
>   at 
> org.apache.hudi.common.table.log.HoodieLogFileReader.(HoodieLogFileReader.java:86)
>   at 
> org.apache.hudi.common.table.log.HoodieLogFormat.newReader(HoodieLogFormat.java:282)
>   at 
> org.apache.hudi.common.table.log.LogReaderUtils.readSchemaFromLogFileInReverse(LogReaderUtils.java:49)
>   at 
> org.apache.hudi.common.table.log.LogReaderUtils.readLatestSchemaFromLogFiles(LogReaderUtils.java:77)
>   at 
> org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:85)
>   at 
> org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(AbstractRealtimeRecordReader.java:67)
>   at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:62)
>   at 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:70)
>   at 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:47)
>   at 
> org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:323)
>   at 
> com.facebook.presto.hive.HiveUtil.createRecordReader(HiveUtil.java:272)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursorProvider.lambda$createRecordCursor$0(GenericHiveRecordCursorProvider.java:74)
>   at 
> com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
>   at 
> com.facebook.presto.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:81)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursorProvider.createRecordCursor(GenericHiveRecordCursorProvider.java:73)
>   at 
> com.facebook.presto.hive.HivePageSourceProvider.getPageSourceFromCursorProvider(HivePageSourceProvider.java:573)
>   at 
> com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:414)
>   at 
> com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:186)
>   at 
> com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:63)
>   at 
> com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:80)
>   at 
> com.facebook.presto.operator.ScanFilterAndProjectOperator.getOutput(ScanFilterAndProjectOperator.java:235)
>   at com.facebook.presto.operator.Driver.processInternal(Driver.java:424)
>   at 
> com.facebook.presto.operator.Driver.lambda$processFor$9(Driver.java:307)
>   at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:728)
>   at com.facebook.presto.operator.Driver.processFor(Driver.java:300)
>   at 
> com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1079)
>   at 
> com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
>   at 
> com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:599)
>   at 
> com.facebook.presto.$gen.Presto_0_270_AHN_0_620220811_002110_1.run(Unknown
>  Source)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4624) Make sure all DeltaStreamer Sources are Closeable

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4624:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Make sure all DeltaStreamer Sources are Closeable
> -
>
> Key: HUDI-4624
> URL: https://issues.apache.org/jira/browse/HUDI-4624
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Alexey Kudinkin
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Currently, many of the sources acquire resources (creating 
> Kafka/Pulsar/SQS/etc clients) but are not releasing them properly, since they 
> don't implement `Closeable` interface proeprly.
> We need to rectify that by making sure that any DS Source that acquires some 
> resource, releases it properly in `close`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4573) Fix HoodieMultiTableDeltaStreamer to write all tables in continuous mode

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4573:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Fix HoodieMultiTableDeltaStreamer to write all tables in continuous mode
> 
>
> Key: HUDI-4573
> URL: https://issues.apache.org/jira/browse/HUDI-4573
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4341) HoodieHFileReader is not compatible with Hadoop 3

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4341:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> HoodieHFileReader is not compatible with Hadoop 3
> -
>
> Key: HUDI-4341
> URL: https://issues.apache.org/jira/browse/HUDI-4341
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: spark
> Fix For: 0.13.0
>
>
> [https://github.com/apache/hudi/issues/5765]
> Spark SQL throws "java.lang.NoSuchMethodError: 
> org.apache.hadoop.hdfs.client.HdfsDataInputStream.getReadStatistics()" after 
> a while.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4369) Hudi Kafka Connect Sink writing to GCS bucket

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4369:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Hudi Kafka Connect Sink writing to GCS bucket
> -
>
> Key: HUDI-4369
> URL: https://issues.apache.org/jira/browse/HUDI-4369
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: kafka-connect
>Reporter: Vishal Agarwal
>Priority: Critical
> Fix For: 0.13.0
>
>
> Hi team,
> I am trying to use Hudi sink connector with Kafka Connect to write to GCS 
> bucket. But I am getting error regarding "gs" file scheme. I have added all 
> GCS related properties in core-site.xml and the corresponding gcs-connector 
> jar in the plugin path. But still facing the issue.
> The issue was already reported with S3 as per jira 
> https://issues.apache.org/jira/browse/HUDI-3610. But I am unable to get the 
> resolution.
> Happy to discuss on this !
> Thanks
> *StackTrace-*
> %d [%thread] %-5level %logger - %msg%n 
> org.apache.hudi.exception.HoodieException: Fatal error instantiating Hudi 
> Write Provider 
>  at 
> org.apache.hudi.connect.writers.KafkaConnectWriterProvider.(KafkaConnectWriterProvider.java:103)
>  ~[connectors-uber.jar:?]
>  at 
> org.apache.hudi.connect.transaction.ConnectTransactionParticipant.(ConnectTransactionParticipant.java:65)
>  ~[connectors-uber.jar:?]
>  at org.apache.hudi.connect.HoodieSinkTask.bootstrap(HoodieSinkTask.java:198) 
> [connectors-uber.jar:?]
>  at org.apache.hudi.connect.HoodieSinkTask.open(HoodieSinkTask.java:151) 
> [connectors-uber.jar:?]
>  at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.openPartitions(WorkerSinkTask.java:587)
>  [connect-runtime-2.4.1.jar:?]
>  at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.access$1100(WorkerSinkTask.java:67)
>  [connect-runtime-2.4.1.jar:?]
>  at 
> org.apache.kafka.connect.runtime.WorkerSinkTask$HandleRebalance.onPartitionsAssigned(WorkerSinkTask.java:652)
>  [connect-runtime-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.invokePartitionsAssigned(ConsumerCoordinator.java:272)
>  [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:400)
>  [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:421)
>  [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:340)
>  [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:471)
>  [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1267)
>  [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1231) 
> [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1211) 
> [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.pollConsumer(WorkerSinkTask.java:444)
>  [connect-runtime-2.4.1.jar:?]
>  at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:317) 
> [connect-runtime-2.4.1.jar:?]
>  at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)
>  [connect-runtime-2.4.1.jar:?]
>  at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)
>  [connect-runtime-2.4.1.jar:?]
>  at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177) 
> [connect-runtime-2.4.1.jar:?]
>  at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227) 
> [connect-runtime-2.4.1.jar:?]
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [?:1.8.0_331]
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_331]
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_331]
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_331]
>  at java.lang.Thread.run(Thread.java:750) [?:1.8.0_331]
> Caused by: org.apache.hudi.exception.HoodieIOException: Failed to get 
> instance of org.apache.hadoop.fs.FileSystem
>  at org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:109) 
> ~[connectors-uber.jar:?]
>  at org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:100) 
> ~[connectors-uber.jar:?]
>  at org.apache.hudi.client.BaseHoodieClient.(BaseHoodieClient.java:69) 
> ~[connectors-uber.jar:?]
>  at 
> org.apache.hudi.client.BaseHoodieWriteClient.(BaseHoodieWriteClient.java:175)
>  ~[connectors-uber.jar:?]

[jira] [Updated] (HUDI-4330) NPE when trying to upsert into a dataset with no Meta Fields

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4330:
--
Fix Version/s: (was: 0.12.1)

> NPE when trying to upsert into a dataset with no Meta Fields
> 
>
> Key: HUDI-4330
> URL: https://issues.apache.org/jira/browse/HUDI-4330
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Raymond Xu
>Priority: Critical
>
> When trying to upsert into a dataset with Meta Fields being disabled, you 
> will encounter obscure NPE like below:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 25 in stage 20.0 failed 4 times, most recent failure: Lost task 25.3 in 
> stage 20.0 (TID 4110) (ip-172-31-20-53.us-west-2.compute.internal executor 
> 7): java.lang.RuntimeException: 
> org.apache.hudi.exception.HoodieIndexException: Error checking bloom filter 
> index.
>         at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
>         at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>         at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
>         at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
>         at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>         at org.apache.spark.scheduler.Task.run(Task.scala:131)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
> Caused by: org.apache.hudi.exception.HoodieIndexException: Error checking 
> bloom filter index.
>         at 
> org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:110)
>         at 
> org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:60)
>         at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
>         ... 16 more
> Caused by: java.lang.NullPointerException
>         at 
> org.apache.hudi.io.HoodieKeyLookupHandle.addKey(HoodieKeyLookupHandle.java:88)
>         at 
> org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:92)
>         ... 18 more {code}
> Instead, we could be more explicit as to why this could have happened 
> (meta-fields disabled -> no bloom filter created -> unable to do upserts)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4454) Support hiveSync command based on Call Procedure Command

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4454:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Support hiveSync command based on Call Procedure Command
> 
>
> Key: HUDI-4454
> URL: https://issues.apache.org/jira/browse/HUDI-4454
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark-sql
>Reporter: Forward Xu
>Assignee: Forward Xu
>Priority: Major
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #6848: [HUDI-4769] Option read.streaming.skip_compaction skips delta commit

2022-10-01 Thread GitBox



hudi-bot commented on PR #6848:
URL: https://github.com/apache/hudi/pull/6848#issuecomment-1264347911

   
   ## CI report:
   
   * 6318c29ab4e0415d365db365aac92f1f562a4b68 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11954)
 
   * fcfc1baee8aa5394c4472135f6ceb0d280cf8c16 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11955)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6850: [Draft][HUDI-4964] inline all the getter methods that have no logic …

2022-10-01 Thread GitBox



hudi-bot commented on PR #6850:
URL: https://github.com/apache/hudi/pull/6850#issuecomment-1264347926

   
   ## CI report:
   
   * e3aef767db19eed24222f8fff89ae4c59d0799c2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11956)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4522) [DOCS] Set presto session prop to use parquet column names in case of type mismatch

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4522:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> [DOCS] Set presto session prop to use parquet column names in case of type 
> mismatch
> ---
>
> Key: HUDI-4522
> URL: https://issues.apache.org/jira/browse/HUDI-4522
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Léo Biscassi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> See https://github.com/apache/hudi/issues/6142



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HUDI-4488) Improve S3 File listing efficiency

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu resolved HUDI-4488.
---

> Improve S3 File listing efficiency
> --
>
> Key: HUDI-4488
> URL: https://issues.apache.org/jira/browse/HUDI-4488
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vamshi Gudavarthi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Improve S3 file listing my moving the check to executor from driver.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4542) Flink streaming query fails with ClassNotFoundException

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4542:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Flink streaming query fails with ClassNotFoundException
> ---
>
> Key: HUDI-4542
> URL: https://issues.apache.org/jira/browse/HUDI-4542
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink-sql
>Reporter: Ethan Guo
>Priority: Critical
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-08-04 at 17.17.42.png
>
>
> Environment: EMR 6.7.0 Flink 1.14.2
> Reproducible steps: Build Hudi Flink bundle from master
> {code:java}
> mvn clean package -DskipTests  -pl :hudi-flink1.14-bundle -am {code}
> Copy to EMR master node /lib/flink/lib
> Launch Flink SQL client:
> {code:java}
> cd /lib/flink && ./bin/yarn-session.sh --detached
> ./bin/sql-client.sh {code}
> Write a Hudi table with a few commits with metadata table enabled (no column 
> stats).  Then, run the following for the streaming query
> {code:java}
> CREATE TABLE t2(
>    uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
>    name VARCHAR(10),
>    age INT,
>    ts TIMESTAMP(3),
>    `partition` VARCHAR(20)
>  )
>  PARTITIONED BY (`partition`)
>  WITH (
>    'connector' = 'hudi',
>    'path' = 's3a://',
>    'table.type' = 'MERGE_ON_READ',
>    'read.streaming.enabled' = 'true',  -- this option enable the streaming 
> read
>    'read.start-commit' = '20220803165232362', -- specifies the start commit 
> instant time
>    'read.streaming.check-interval' = '4' -- specifies the check interval for 
> finding new source commits, default 60s.
>  ); {code}
> {code:java}
> select * from t2; {code}
> {code:java}
> Flink SQL> select * from t2;
> 2022-08-05 00:12:43,635 INFO  org.apache.hadoop.metrics2.impl.MetricsConfig   
>              [] - Loaded properties from hadoop-metrics2.properties
> 2022-08-05 00:12:43,650 INFO  
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl            [] - Scheduled 
> Metric snapshot period at 300 second(s).
> 2022-08-05 00:12:43,650 INFO  
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl            [] - 
> s3a-file-system metrics system started
> 2022-08-05 00:12:47,722 INFO  org.apache.hadoop.fs.s3a.S3AInputStream         
>              [] - Switching to Random IO seek policy
> 2022-08-05 00:12:47,941 INFO  org.apache.hadoop.yarn.client.RMProxy           
>              [] - Connecting to ResourceManager at 
> ip-172-31-9-157.us-east-2.compute.internal/172.31.9.157:8032
> 2022-08-05 00:12:47,942 INFO  org.apache.hadoop.yarn.client.AHSProxy          
>              [] - Connecting to Application History server at 
> ip-172-31-9-157.us-east-2.compute.internal/172.31.9.157:10200
> 2022-08-05 00:12:47,942 INFO  org.apache.flink.yarn.YarnClusterDescriptor     
>              [] - No path for the flink jar passed. Using the location of 
> class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
> 2022-08-05 00:12:47,942 WARN  org.apache.flink.yarn.YarnClusterDescriptor     
>              [] - Neither the HADOOP_CONF_DIR nor the YARN_CONF_DIR 
> environment variable is set.The Flink YARN Client needs one of these to be 
> set to properly load the Hadoop configuration for accessing YARN.
> 2022-08-05 00:12:47,959 INFO  org.apache.flink.yarn.YarnClusterDescriptor     
>              [] - Found Web Interface 
> ip-172-31-3-92.us-east-2.compute.internal:39605 of application 
> 'application_1659656614768_0001'.
> [ERROR] Could not execute SQL statement. Reason:
> java.lang.ClassNotFoundException: 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat{code}
> {code:java}
> 2022-08-04 17:12:59
> org.apache.flink.runtime.JobException: Recovery is suppressed by 
> NoRestartBackoffTimeStrategy
>     at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:138)
>     at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:82)
>     at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:228)
>     at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:218)
>     at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:209)
>     at 
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:679)
>     at 
> org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:79)
>     at 
> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:444)
>     at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)
>     at 
> sun.ref

[jira] [Updated] (HUDI-4457) Make sure IT docker test return code non-zero when failed

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4457:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Make sure IT docker test return code non-zero when failed
> -
>
> Key: HUDI-4457
> URL: https://issues.apache.org/jira/browse/HUDI-4457
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: tests-ci
>Reporter: Raymond Xu
>Priority: Critical
> Fix For: 0.13.0
>
>
> IT testcase where docker command runs and returns exit code 0, but test 
> actually failed. This will be misleading for troubleshooting.
> TODO
> 1. verify the behavior
> 2. fix it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4302) CI Instability / flaky tests

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4302:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> CI Instability / flaky tests
> 
>
> Key: HUDI-4302
> URL: https://issues.apache.org/jira/browse/HUDI-4302
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.13.0
>
>
> Creating an EPIC to track the flaky tests



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4244) Support common Spark transformations w/in Spark SQL "partitioned by" clause

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4244:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Support common Spark transformations w/in Spark SQL "partitioned by" clause
> ---
>
> Key: HUDI-4244
> URL: https://issues.apache.org/jira/browse/HUDI-4244
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Priority: Major
> Fix For: 0.13.0
>
>
> Currently if you create a Hudi table from Spark SQL:
> {code:java}
> CREATE TABLE test_create(
> f1 STRING,
> f2 STRING,
> f3 STRING,
> ts timestamp
> )using hudi
>  partitioned by ( hours(ts))
>  options (
>   type = 'mor'
>  )
>  tblproperties (
>   primaryKey = 'f1',
>   preCombineField = 'ts'
> );
>  {code}
> You'll be getting
> {code:java}
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
> org.apache.spark.sql.AnalysisException: Transforms cannot be converted to 
> partition columns: hours(ts){code}
>  
> Original reported task:
> [https://github.com/apache/hudi/issues/5810]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HUDI-4237) spark.sql.sources.schema.partCol.0 is non-empty in HiveMetaStore when create non-partition hudi table in Spark

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu resolved HUDI-4237.
---

> spark.sql.sources.schema.partCol.0 is non-empty in HiveMetaStore when create 
> non-partition hudi table in Spark
> --
>
> Key: HUDI-4237
> URL: https://issues.apache.org/jira/browse/HUDI-4237
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: du.junling
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Create a non-partition hudi table in Spark，it will store 
> spark.sql.sources.schema.partCol.0 with an empty value in hiveMetastore.
> This is unexpected behavior.
> Steps to reproduce the behavior:
> 1. Create a non-partition hudi table in Spark
> {code:java}
> create table hudi_mor_tbl (
> id int,
> name string,
> price double,
> ts bigint
> ) using hudi
> tblproperties (
> type = 'mor',
> primaryKey = 'id',
> preCombineField = 'ts'
> ） {code}
> 2. insert data one row to it.
> {code:java}
> insert into hudi_mor_tbl select 1, 'a1', 20, 1000; {code}
> 3. cat hoodie.properties in table's base path,
> it include partition.fields key with an empty value
> {code:java}
> hoodie.table.partition.fields=
>  {code}
> 4. check spark.sql.sources.schema.partCol.0 that stored in table TABLE_PARAMS 
> of the HiveMetaStore .
> {code:java}
> |50|spark.sql.sources.schema.partCol.0|
>  {code}
> it has a value "".
> *Expected behavior*
> this is no hoodie.table.partition.fields in hoodie.properties and 
> spark.sql.sources.schema.partCol.0  in HiveMetastore
> *Environment Description*
>  * Hudi version : 0.10.0
>  * Spark version : 3.2.1
>  * Hive version : 3.1.2
>  * Hadoop version : 3.3.1
>  * Storage (HDFS/S3/GCS..) : HDFS
>  * Running on Docker? (yes/no) : no



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4330) NPE when trying to upsert into a dataset with no Meta Fields

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4330:
--
Fix Version/s: 0.13.0

> NPE when trying to upsert into a dataset with no Meta Fields
> 
>
> Key: HUDI-4330
> URL: https://issues.apache.org/jira/browse/HUDI-4330
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Raymond Xu
>Priority: Critical
> Fix For: 0.13.0
>
>
> When trying to upsert into a dataset with Meta Fields being disabled, you 
> will encounter obscure NPE like below:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 25 in stage 20.0 failed 4 times, most recent failure: Lost task 25.3 in 
> stage 20.0 (TID 4110) (ip-172-31-20-53.us-west-2.compute.internal executor 
> 7): java.lang.RuntimeException: 
> org.apache.hudi.exception.HoodieIndexException: Error checking bloom filter 
> index.
>         at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
>         at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>         at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
>         at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
>         at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>         at org.apache.spark.scheduler.Task.run(Task.scala:131)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
> Caused by: org.apache.hudi.exception.HoodieIndexException: Error checking 
> bloom filter index.
>         at 
> org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:110)
>         at 
> org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:60)
>         at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
>         ... 16 more
> Caused by: java.lang.NullPointerException
>         at 
> org.apache.hudi.io.HoodieKeyLookupHandle.addKey(HoodieKeyLookupHandle.java:88)
>         at 
> org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:92)
>         ... 18 more {code}
> Instead, we could be more explicit as to why this could have happened 
> (meta-fields disabled -> no bloom filter created -> unable to do upserts)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4123:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> When use SqlSource：
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown：
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null，
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       "

[jira] [Updated] (HUDI-4184) Creating external table in Spark SQL modifies "hoodie.properties"

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4184:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Creating external table in Spark SQL modifies "hoodie.properties"
> -
>
> Key: HUDI-4184
> URL: https://issues.apache.org/jira/browse/HUDI-4184
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Alexey Kudinkin
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.13.0
>
>
> My setup was like following:
>  # There's a table existing in one AWS account
>  # I'm trying to access that table from Spark SQL from _another_ AWS account 
> that only has Read permissions to the bucket with the table.
>  # Now when issuing "CREATE TABLE" Spark SQL command it fails b/c Hudi tries 
> to modify "hoodie.properties" file for whatever reason, even though i'm not 
> modifying the table and just trying to create table in the catalog.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4209) TestHoodieBulkInsertDataInternalWriter fails due to MiniDFSCluster class not found

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4209:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> TestHoodieBulkInsertDataInternalWriter fails due to MiniDFSCluster class not 
> found
> --
>
> Key: HUDI-4209
> URL: https://issues.apache.org/jira/browse/HUDI-4209
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Critical
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4036) Investigate whether meta fields could be omitted completely

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4036:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Investigate whether meta fields could be omitted completely
> ---
>
> Key: HUDI-4036
> URL: https://issues.apache.org/jira/browse/HUDI-4036
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.13.0
>
>
> Currently, even when meta fields are not populated, we still insert 
> empty-string columns to adhere to the expected schema.
> This has a non-trivial overhead of ~20% (relative to just writing dataset as 
> is), since Spark had to essentially "re-write" the original row with 
> prepended new fields.
> We should investigate whether it's feasible to avoid adding empty-string 
> columns completely if meta-fields are disabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4010) DynamoDB lock configs for naming/docs could be improved

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4010:
--
Status: Patch Available  (was: In Progress)

> DynamoDB lock configs for naming/docs could be improved
> ---
>
> Key: HUDI-4010
> URL: https://issues.apache.org/jira/browse/HUDI-4010
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Udit Mehrotra
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> The issue was raised in [https://github.com/apache/hudi/issues/5451,] where 
> the customer created the DynamoDB table manually and assumed *partition_key* 
> property to be the column name that is to be used instead of the actual value 
> of the partition key column. The name of the partition key column is fixed in 
> code to {*}key{*}.
>  
> There is scope to improve the naming, so it is not confusing whether one 
> needs to define a columns name there or the column value to be used for that 
> specific table. Also may be we can better document that the table can be 
> created automatically by Hudi code, and if at all someone creates table by 
> themselves, what column name to use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4021) Support deferring compaction when there is an inflight delta commit

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4021:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Support deferring compaction when there is an inflight delta commit
> ---
>
> Key: HUDI-4021
> URL: https://issues.apache.org/jira/browse/HUDI-4021
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: multi-writer, writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.13.0
>
>
> in multi-writer scenario, if there is an inflight delta commit by writer1 and 
> if writer2 i looking to schedule compaction, the delta commit might 
> eventually fail. This may not be a good experience for end user. Think about 
> regular writes are happening via deltastreamer and a separate async process 
> is taking care of scheduling and executing compaction. Deltastreamer 
> continuous mode might get aborted whenever such overstepping happens. We 
> should add a support where in, compaction scheduling is deferred when an 
> inflight delta commit is found in the timeline. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3892) Add HoodieReadClient with java

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-3892:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Add HoodieReadClient with java
> --
>
> Key: HUDI-3892
> URL: https://issues.apache.org/jira/browse/HUDI-3892
> Project: Apache Hudi
>  Issue Type: Task
>  Components: reader-core
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.13.0
>
>
> We might need a hoodie read client in java similar to the one we have for 
> spark. 
> [Apache Pulsar|https://github.com/apache/pulsar] is doing integration with 
> Hudi, and take Hudi as tiered storage to offload topic cold data into Hudi. 
> When consumers fetch cold data from topic, Pulsar broker will locate the 
> target data is stored in Pulsar or not. If the target data stored in tiered 
> storage (Hudi), Pulsar broker will fetch data from Hudi by Java API, and 
> package them into Pulsar format and dispatch to consumer side.
> However, we found current Hudi implementation doesn't support read Hudi table 
> records by Java API, and we couldn't read the target data out from Hudi into 
> Pulsar Broker, which will block the Pulsar & Hudi integration.
> h3. What we need
>  # We need Hudi to support reading records by Java API
>  # We need Hudi to support read records out which keep the writer order, or 
> support order by specific fields.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3919) Support out of order rollback blocks in AbstractHoodieLogRecordReader

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-3919:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Support out of order rollback blocks in AbstractHoodieLogRecordReader
> -
>
> Key: HUDI-3919
> URL: https://issues.apache.org/jira/browse/HUDI-3919
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Surya Prasanna Yalla
>Assignee: Surya Prasanna Yalla
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently the rollback block refers to previous data block for invalidating. 
> In multiwriter scenario the block need not be previous block. So, need to 
> support out of order rollback blocks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3956) Optimize hudi-presto-bundle size

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-3956:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Optimize hudi-presto-bundle size
> 
>
> Key: HUDI-3956
> URL: https://issues.apache.org/jira/browse/HUDI-3956
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Sagar Sumit
>Priority: Major
> Fix For: 0.13.0
>
>
> With the inclusion of hbase-server and its shading, the bundle size is about 
> 33mb compared to 17mb previously. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3882) Make sure Hudi Spark relations implementations provide similar file-scanning metrics

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-3882:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Make sure Hudi Spark relations implementations provide similar file-scanning 
> metrics
> 
>
> Key: HUDI-3882
> URL: https://issues.apache.org/jira/browse/HUDI-3882
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.13.0
>
>
> After rebasing Hudi's COW table onto `BaseFileOnlyRelation` from 
> `HadoopFsRelation`, we lost very helpful file-scanning metrics like # of 
> files scanned, total size of files scanned, etc.
>  
> After some investigation, i've found out that this occurred b/c now instead 
> of relying on `FileScan` node in Spark plans like we're before, it now uses 
> `DataScan` node that doesn't provide such metrics.
> This unwanted transition occurred b/c Spark internally predicates on 
> `HadoopFsRelation` to decide whether it's a `FileScan` or `DataScan`, and 
> since we stopped using `HadoopFsRelation` Hudi relations now fall into the 
> latter bucket



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3881) Implement index syntax for spark sql

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-3881:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Implement index syntax for spark sql
> 
>
> Key: HUDI-3881
> URL: https://issues.apache.org/jira/browse/HUDI-3881
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark-sql
>Reporter: Forward Xu
>Assignee: Forward Xu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> {code:java}
> 1. Create Index
> CREATE INDEX [IF NOT EXISTS] index_name
> ON [TABLE] [db_name.]table_name (column_name [ASC|DESC], ...) 
> USING [bloom/lucene]
> [PROPERTIES ('key'='value')] 
> 2. Refresh Index
> REFRESH INDEX index_name ON [TABLE] [db_name.]table_name
> 3. Drop Index 
> DROP INDEX [IF EXISTS] index_name ON [TABLE] [db_name.]table_name
> 4. Show index
> SHOW INDEX [index_name] ON [TABLE] [db_name.]table_name{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3775) Allow for offline compaction of MOR tables via spark streaming

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-3775:
--
Fix Version/s: (was: 0.12.1)

> Allow for offline compaction of MOR tables via spark streaming
> --
>
> Key: HUDI-3775
> URL: https://issues.apache.org/jira/browse/HUDI-3775
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction, spark
>Reporter: Rajesh
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: easyfix
>
> Currently there is no way to avoid compaction taking up a lot of resources 
> when run inline or async for MOR tables via Spark Streaming. Delta Streamer 
> has ways to assign resources between ingestion and async compaction but Spark 
> Streaming does not have that option. 
> Introducing a flag to turn off automatic compaction and allowing users to run 
> compaction in a separate process will decouple both concerns.
> This will also allow the users to size the cluster just for ingestion and 
> deal with compaction separate without blocking.  We will need to look into 
> documenting best practices for running offline compaction.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #6850: [Draft][HUDI-4964] inline all the getter methods that have no logic …

2022-10-01 Thread GitBox



hudi-bot commented on PR #6850:
URL: https://github.com/apache/hudi/pull/6850#issuecomment-1264348973

   
   ## CI report:
   
   * e3aef767db19eed24222f8fff89ae4c59d0799c2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11956)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3796) Implement layout to filter out uncommitted log files without reading the log blocks

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-3796:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Implement layout to filter out uncommitted log files without reading the log 
> blocks
> ---
>
> Key: HUDI-3796
> URL: https://issues.apache.org/jira/browse/HUDI-3796
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.13.0
>
>
> Related: HUDI-3637
> At high level, getLatestFileSlices() is going to fetch the latest file slices 
> for committed base files and filter out any file slices with the uncommitted 
> base instant time.  The uncommitted log files in the latest file slices may 
> be included, and they are skipped while doing log reading and merging, i.e., 
> the logic in "AbstractHoodieLogRecordReader".
> We can use log instant time instead of base instant time for the log file 
> name so that it is able to filter out uncommitted log files without reading 
> the log blocks beforehand.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3818) hudi doesn't support bytes column as primary key

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-3818:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> hudi doesn't support bytes column as primary key
> 
>
> Key: HUDI-3818
> URL: https://issues.apache.org/jira/browse/HUDI-3818
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive
>Reporter: rex xiong
>Assignee: rex xiong
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
>  when use bytes column as primary key, hudi will generate fixed hoodie key, 
> then upserts will only insert one row. 
> {code:java}
> scala> sql("desc extended binary_test1").show()
> +++---+
> |            col_name|           data_type|comment|
> +++---+
> | _hoodie_commit_time|              string|   null|
> |_hoodie_commit_seqno|              string|   null|
> |  _hoodie_record_key|              string|   null|
> |_hoodie_partition...|              string|   null|
> |   _hoodie_file_name|              string|   null|
> |                  id|              binary|   null|
> |                name|              string|   null|
> |                  dt|              string|   null|
> |                    |                    |       |
> |# Detailed Table ...|                    |       |
> |            Database|             default|       |
> |               Table|        binary_test1|       |
> |               Owner|                root|       |
> |        Created Time|Sat Apr 02 13:28:...|       |
> |         Last Access|             UNKNOWN|       |
> |          Created By|         Spark 3.2.0|       |
> |                Type|             MANAGED|       |
> |            Provider|                hudi|       |
> |    Table Properties|[last_commit_time...|       |
> |          Statistics|        435194 bytes|       |
> +++---+
> scala> sql("select * from binary_test1").show()
> +---+++--+++-++
> |_hoodie_commit_time|_hoodie_commit_seqno|  
> _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|               
>    id|     name|      dt|
> +---+++--+++-++
> |  20220402132927590|20220402132927590...|id:java.nio.HeapB...|               
>        |1a06106e-5e7a-4e6...|[03 45 6A 00 00 0...|Mary Jane|20220401|
> +---+++--+++-++{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3775) Allow for offline compaction of MOR tables via spark streaming

2022-10-01 Thread Zhaojing Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-3775:
--
Fix Version/s: 0.13.0

> Allow for offline compaction of MOR tables via spark streaming
> --
>
> Key: HUDI-3775
> URL: https://issues.apache.org/jira/browse/HUDI-3775
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction, spark
>Reporter: Rajesh
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: easyfix
> Fix For: 0.13.0
>
>
> Currently there is no way to avoid compaction taking up a lot of resources 
> when run inline or async for MOR tables via Spark Streaming. Delta Streamer 
> has ways to assign resources between ingestion and async compaction but Spark 
> Streaming does not have that option. 
> Introducing a flag to turn off automatic compaction and allowing users to run 
> compaction in a separate process will decouple both concerns.
> This will also allow the users to size the cluster just for ingestion and 
> deal with compaction separate without blocking.  We will need to look into 
> documenting best practices for running offline compaction.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

1 2 >

1 - 100 of 161 matches

Mail list logo