Re: [PR] [HUDI-6952] Skip reading the uncommitted log files for log reader [hudi]
yihua commented on code in PR #9879: URL: https://github.com/apache/hudi/pull/9879#discussion_r1363329412 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java: ## @@ -241,7 +241,12 @@ private void scanInternalV1(Option keySpecOpt) { try { // Iterate over the paths logFormatReaderWrapper = new HoodieLogFormatReader(fs, - logFilePaths.stream().map(logFile -> new HoodieLogFile(new CachingPath(logFile))).collect(Collectors.toList()), + logFilePaths.stream() Review Comment: Should the same filtering logic be wrapped into a util method given it is also used in `scanInternalV2`? ## hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java: ## @@ -241,7 +241,12 @@ private void scanInternalV1(Option keySpecOpt) { try { // Iterate over the paths logFormatReaderWrapper = new HoodieLogFormatReader(fs, - logFilePaths.stream().map(logFile -> new HoodieLogFile(new CachingPath(logFile))).collect(Collectors.toList()), + logFilePaths.stream() + .map(filePath -> new HoodieLogFile(new CachingPath(filePath))) + // hit an uncommitted file possibly from a failed write, skip processing this one + .filter(logFile -> completedInstantsTimeline.containsOrBeforeTimelineStarts(logFile.getDeltaCommitTime()) Review Comment: Should this logic be dependent on table version? ## hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java: ## @@ -269,11 +274,6 @@ private void scanInternalV1(Option keySpecOpt) { break; } if (logBlock.getBlockType() != CORRUPT_BLOCK && logBlock.getBlockType() != COMMAND_BLOCK) { - if (!completedInstantsTimeline.containsOrBeforeTimelineStarts(instantTime) Review Comment: Similar here wrt table version. Before log file name change, this logic is still needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5031] Fix MERGE INTO creates empty partition files when source table has partitions but target table does not [hudi]
hudi-bot commented on PR #6983: URL: https://github.com/apache/hudi/pull/6983#issuecomment-1767760342 ## CI report: * 003721a9e975415951aed2725a744b29f87cacc1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20377) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [MINOR] HFileBootstrapIndex: use try-with-resources in two places (#9813)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 9b7e539a902 [MINOR] HFileBootstrapIndex: use try-with-resources in two places (#9813) 9b7e539a902 is described below commit 9b7e539a902cb2cf594a799957af260cf00ab8b4 Author: Tim Brown AuthorDate: Wed Oct 18 01:21:51 2023 -0500 [MINOR] HFileBootstrapIndex: use try-with-resources in two places (#9813) --- .../org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java b/hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java index 0d821ffe103..27314f150dc 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java @@ -322,8 +322,7 @@ public class HFileBootstrapIndex extends BootstrapIndex { @Override public List getSourceFileMappingForPartition(String partition) { - try { -HFileScanner scanner = partitionIndexReader().getScanner(true, false); + try (HFileScanner scanner = partitionIndexReader().getScanner(true, false)) { KeyValue keyValue = new KeyValue(Bytes.toBytes(getPartitionKey(partition)), new byte[0], new byte[0], HConstants.LATEST_TIMESTAMP, KeyValue.Type.Put, new byte[0]); if (scanner.seekTo(keyValue) == 0) { @@ -355,8 +354,7 @@ public class HFileBootstrapIndex extends BootstrapIndex { // Arrange input Keys in sorted order for 1 pass scan List fileGroupIds = new ArrayList<>(ids); Collections.sort(fileGroupIds); - try { -HFileScanner scanner = fileIdIndexReader().getScanner(true, false); + try (HFileScanner scanner = fileIdIndexReader().getScanner(true, false)) { for (HoodieFileGroupId fileGroupId : fileGroupIds) { KeyValue keyValue = new KeyValue(Bytes.toBytes(getFileGroupKey(fileGroupId)), new byte[0], new byte[0], HConstants.LATEST_TIMESTAMP, KeyValue.Type.Put, new byte[0]);
Re: [PR] [MINOR] HFileBootstrapIndex: use try-with-resources in two places [hudi]
yihua merged PR #9813: URL: https://github.com/apache/hudi/pull/9813 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6878] Fix Overwrite error when ingest multiple tables [hudi]
hudi-bot commented on PR #9749: URL: https://github.com/apache/hudi/pull/9749#issuecomment-1767680714 ## CI report: * 149dfda8469d598e3098c418ce1e7bf99a4a177f UNKNOWN * 66ea14a95621e003cbf81773c78f0ad2147bbbf6 UNKNOWN * 1597dfa2436c2789e5a5e8dbecfe4f900383c35d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20374) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Trino queries failing when hudi.metadata_enabled is set to true. [hudi]
BalaMahesh commented on issue #9758: URL: https://github.com/apache/hudi/issues/9758#issuecomment-1767671730 @ad1happy2go / @codope - hoodie.metadata.compact.max.delta.commits=1 with this config I expect compaction to run after every metadata commit, but that is not happening and all the delta commits are being piled up on metadata path and after accumulating large number of files, compaction is being triggered and failing with OOM error. I didn't get enough time to go through the code flow for this metadata table. We can increase the memory fraction - if this the problem it should at least trigger the compaction after every delta commit and fail right with hoodie.metadata.compact.max.delta.commits set to 1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6953] Optimizing hudi sink operators generation [hudi]
zhuanshenbsj1 commented on code in PR #9878: URL: https://github.com/apache/hudi/pull/9878#discussion_r1363170585 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSink.java: ## @@ -106,17 +106,26 @@ public SinkRuntimeProvider getSinkRuntimeProvider(Context context) { // bootstrap final DataStream hoodieRecordDataStream = Pipelines.bootstrap(conf, rowType, dataStream, context.isBounded(), overwrite); + // write pipeline pipeline = Pipelines.hoodieStreamWrite(conf, hoodieRecordDataStream); - // compaction + + // insert cluster mode + if (OptionsResolver.isInsertClusterMode(conf)) { +return Pipelines.clean(conf, pipeline); + } + + // upsert mode if (OptionsResolver.needsAsyncCompaction(conf)) { // use synchronous compaction for bounded source. if (context.isBounded()) { conf.setBoolean(FlinkOptions.COMPACTION_ASYNC_ENABLED, false); } return Pipelines.compact(conf, pipeline); - } else { + } else if (OptionsResolver.isLazyFailedWritesCleanPolicy(conf)) { return Pipelines.clean(conf, pipeline); + } else { +return Pipelines.dummySink(pipeline); Review Comment: Similar to clustering, cleaning is performed wherever merging is performed(inline or offline). ``` if (OptionsResolver.needsAsyncCompaction(conf)) { // 1 // use synchronous compaction for bounded source. if (context.isBounded()) { conf.setBoolean(FlinkOptions.COMPACTION_ASYNC_ENABLED, false); } return Pipelines.compact(conf, pipeline); } else if (OptionsResolver.isLazyFailedWritesCleanPolicy(conf)) { //2.1 return Pipelines.clean(conf, pipeline); } else { //2.2 return Pipelines.dummySink(pipeline); } ``` 1. If flink online asynchronous merge execute is turned on, cluster/compactor commit operator will do clean. 2. If flink online asynchronous merge execute is turned off, there are two situations 2.1 To enable lazy cleaning, it is necessary to add the clean operator for rollback. 2.2 To disable lazy cleaning, there is no need to consider rollback. Clean will be called after offline task execution is completed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767633844 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN * 8826dfc2e2487c43703787c737d8143c6bb7285a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20373) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6369] Fix spacial curve with sample strategy fails when 0 or 1 rows only is incoming [hudi]
hudi-bot commented on PR #9053: URL: https://github.com/apache/hudi/pull/9053#issuecomment-1767633129 ## CI report: * ff5cd07154d48f18d8034075c8dfc3990b204cbe Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18532) * 8caf378576e4c7e68cdd32d1e24d89afc05b056b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20379) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5220] fix hive snapshot query add non hoodie paths file status [hudi]
hudi-bot commented on PR #7206: URL: https://github.com/apache/hudi/pull/7206#issuecomment-1767631379 ## CI report: * 5d7a1c4824c100a48c95e3d017822aa1062ad8cd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20258) * 9d653e9325beb6e3391607d073dfa8c030ee798f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20378) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5031] Fix MERGE INTO creates empty partition files when source table has partitions but target table does not [hudi]
hudi-bot commented on PR #6983: URL: https://github.com/apache/hudi/pull/6983#issuecomment-1767631150 ## CI report: * 0593cd212628684db658d7a8bdd8fc320069d090 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20225) * 003721a9e975415951aed2725a744b29f87cacc1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20377) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6369] Fix spacial curve with sample strategy fails when 0 or 1 rows only is incoming [hudi]
hudi-bot commented on PR #9053: URL: https://github.com/apache/hudi/pull/9053#issuecomment-1767625108 ## CI report: * ff5cd07154d48f18d8034075c8dfc3990b204cbe Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18532) * 8caf378576e4c7e68cdd32d1e24d89afc05b056b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5220] fix hive snapshot query add non hoodie paths file status [hudi]
hudi-bot commented on PR #7206: URL: https://github.com/apache/hudi/pull/7206#issuecomment-1767623659 ## CI report: * 5d7a1c4824c100a48c95e3d017822aa1062ad8cd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20258) * 9d653e9325beb6e3391607d073dfa8c030ee798f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5031] Fix MERGE INTO creates empty partition files when source table has partitions but target table does not [hudi]
hudi-bot commented on PR #6983: URL: https://github.com/apache/hudi/pull/6983#issuecomment-1767623459 ## CI report: * 0593cd212628684db658d7a8bdd8fc320069d090 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20225) * 003721a9e975415951aed2725a744b29f87cacc1 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6952] Skip reading the uncommitted log files for log reader [hudi]
hudi-bot commented on PR #9879: URL: https://github.com/apache/hudi/pull/9879#issuecomment-1767619953 ## CI report: * aa997ac209a57ace18f76bdc5fa602d0bead8345 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20376) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6953] Optimizing hudi sink operators generation [hudi]
hudi-bot commented on PR #9878: URL: https://github.com/apache/hudi/pull/9878#issuecomment-1767619917 ## CI report: * 4eb8e5387cb728bde662a45f77062bd574c6cff0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20375) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1767619892 ## CI report: * 794904512405851fa42c10927c315ca55d82fbdc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20372) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6878] Fix Overwrite error when ingest multiple tables [hudi]
hudi-bot commented on PR #9749: URL: https://github.com/apache/hudi/pull/9749#issuecomment-1767619726 ## CI report: * 149dfda8469d598e3098c418ce1e7bf99a4a177f UNKNOWN * 66ea14a95621e003cbf81773c78f0ad2147bbbf6 UNKNOWN * 918ff90b4bc079e5053fcc8a3b3f0d472d30ca1e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20363) * 1597dfa2436c2789e5a5e8dbecfe4f900383c35d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20374) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6953] Optimizing hudi sink operators generation [hudi]
danny0405 commented on code in PR #9878: URL: https://github.com/apache/hudi/pull/9878#discussion_r1363158192 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSink.java: ## @@ -106,17 +106,26 @@ public SinkRuntimeProvider getSinkRuntimeProvider(Context context) { // bootstrap final DataStream hoodieRecordDataStream = Pipelines.bootstrap(conf, rowType, dataStream, context.isBounded(), overwrite); + // write pipeline pipeline = Pipelines.hoodieStreamWrite(conf, hoodieRecordDataStream); - // compaction + + // insert cluster mode + if (OptionsResolver.isInsertClusterMode(conf)) { +return Pipelines.clean(conf, pipeline); + } + + // upsert mode if (OptionsResolver.needsAsyncCompaction(conf)) { // use synchronous compaction for bounded source. if (context.isBounded()) { conf.setBoolean(FlinkOptions.COMPACTION_ASYNC_ENABLED, false); } return Pipelines.compact(conf, pipeline); - } else { + } else if (OptionsResolver.isLazyFailedWritesCleanPolicy(conf)) { return Pipelines.clean(conf, pipeline); + } else { +return Pipelines.dummySink(pipeline); Review Comment: Sorry, I didn't get why you remove the clean operators then? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
danny0405 commented on code in PR #9743: URL: https://github.com/apache/hudi/pull/9743#discussion_r1363138608 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java: ## @@ -206,6 +213,9 @@ public IndexedRecord next() { IndexedRecord record = this.reader.read(null, decoder); this.dis.skipBytes(recordLength); this.readRecords++; +if (this.promotedSchema.isPresent()) { + return HoodieAvroUtils.rewriteRecordWithNewSchema(record, this.promotedSchema.get()); Review Comment: Yeah, if there are some data types that require a rewrite then keeping it as it is might be good now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6369] Fix spacial curve with sample strategy fails when 0 or 1 rows only is incoming [hudi]
bvaradar commented on PR #9053: URL: https://github.com/apache/hudi/pull/9053#issuecomment-1767610090 Looks good to me. Resurrecting this PR with rebase and minor test class rename. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6952] Skip reading the uncommitted log files for log reader [hudi]
hudi-bot commented on PR #9879: URL: https://github.com/apache/hudi/pull/9879#issuecomment-1767581232 ## CI report: * aa997ac209a57ace18f76bdc5fa602d0bead8345 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6953] Optimizing hudi sink operators generation [hudi]
hudi-bot commented on PR #9878: URL: https://github.com/apache/hudi/pull/9878#issuecomment-1767581203 ## CI report: * 4eb8e5387cb728bde662a45f77062bd574c6cff0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6878] Fix Overwrite error when ingest multiple tables [hudi]
hudi-bot commented on PR #9749: URL: https://github.com/apache/hudi/pull/9749#issuecomment-1767581062 ## CI report: * 149dfda8469d598e3098c418ce1e7bf99a4a177f UNKNOWN * 66ea14a95621e003cbf81773c78f0ad2147bbbf6 UNKNOWN * 918ff90b4bc079e5053fcc8a3b3f0d472d30ca1e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20363) * 1597dfa2436c2789e5a5e8dbecfe4f900383c35d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767581022 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * bc45850e7a2962242d4e99e88b07c89b8c8e19bf Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20371) * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN * 8826dfc2e2487c43703787c737d8143c6bb7285a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20373) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767573921 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * bc45850e7a2962242d4e99e88b07c89b8c8e19bf Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20371) * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN * 8826dfc2e2487c43703787c737d8143c6bb7285a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6953] Optimizing hudi sink operators generation [hudi]
zhuanshenbsj1 commented on PR #9878: URL: https://github.com/apache/hudi/pull/9878#issuecomment-1767563266 cc @danny0405 @yihua -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6675] Fix Clean action will delete the whole table [hudi]
TengHuo commented on PR #9413: URL: https://github.com/apache/hudi/pull/9413#issuecomment-1767562595 > In my company, I also encountered a situation where the entire table directory was deleted Hi @wqlsdb would you like to discuss it offline or email? We encountered this issue multiple times internally, and we are trying to find the root cause. Think it could be helpful if we can find sync some common information. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6675] Fix Clean action will delete the whole table [hudi]
danny0405 commented on PR #9413: URL: https://github.com/apache/hudi/pull/9413#issuecomment-1767562306 @wqlsdb , would you mind to cherry-pick this fix into your local repo? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6952) Skip reading the uncommitted log files for log reader
[ https://issues.apache.org/jira/browse/HUDI-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6952: - Labels: pull-request-available (was: ) > Skip reading the uncommitted log files for log reader > - > > Key: HUDI-6952 > URL: https://issues.apache.org/jira/browse/HUDI-6952 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-6952] Skip reading the uncommitted log files for log reader [hudi]
danny0405 opened a new pull request, #9879: URL: https://github.com/apache/hudi/pull/9879 ### Change Logs This is to avoid potential exceptions when the reader is processing an uncommitted log file while the cleaning or rollback service removes the log file. ### Impact none ### Risk level (write none, low medium or high below) none ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6675] Fix Clean action will delete the whole table [hudi]
wqlsdb commented on PR #9413: URL: https://github.com/apache/hudi/pull/9413#issuecomment-1767559647 In my company, I also encountered a situation where the entire table directory was deleted -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-6953] Optimizing hudi sink operators generation [hudi]
zhuanshenbsj1 opened a new pull request, #9878: URL: https://github.com/apache/hudi/pull/9878 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Dirty data filtering failed [hudi]
deasea opened a new issue, #9877: URL: https://github.com/apache/hudi/issues/9877 HI, I encountered an exception ![image](https://github.com/apache/hudi/assets/35282893/cb4c90c0-f62a-4799-b598-a7d7348ae293) I want to filter out this dirty data when entering the lake ,I tried 2 parameters, neither worked. write.ignore.failed:true hoodie.datasource.write.streaming.ignore.failed.batch : true flink 1.13.5 hudi 0.13/0.10 How should we skip dirty data in this scenario? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767533835 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * bc45850e7a2962242d4e99e88b07c89b8c8e19bf Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20371) * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1767534035 ## CI report: * 794904512405851fa42c10927c315ca55d82fbdc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20372) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1767526069 ## CI report: * 794904512405851fa42c10927c315ca55d82fbdc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767518410 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * bc45850e7a2962242d4e99e88b07c89b8c8e19bf Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20371) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6953) Optimizing hudi sink operators generation
zhuanshenbsj1 created HUDI-6953: --- Summary: Optimizing hudi sink operators generation Key: HUDI-6953 URL: https://issues.apache.org/jira/browse/HUDI-6953 Project: Apache Hudi Issue Type: Improvement Reporter: zhuanshenbsj1 Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6878] Fix Overwrite error when ingest multiple tables [hudi]
stream2000 commented on PR #9749: URL: https://github.com/apache/hudi/pull/9749#issuecomment-1767491739 > could you write a test or provide a some sample code to trigger this issue? I'm a little unclear if this is solving a race condition or something else @jonvex We can trigger the issue by the following code( not stable) : ```scala test("Test concurrent overwrite") { withTempDir { tmp => import spark.implicits._ val day = "2021-08-02" val threadCount = 12 val df = Seq((1, "a1", 10, 1000, day, 12)).toDF("id", "name", "value", "ts", "day", "hh") val executors = Executors.newFixedThreadPool(threadCount) var futures: Array[Future[_]] = new Array(threadCount /2 ) for (i <- 0 until threadCount / 2 ) { val overwriteTask = new Runnable { override def run(): Unit = { val tableName = "table_name" + i val tablePath = s"${tmp.getCanonicalPath}/$tableName" // Write a table by spark dataframe. df.write.format("hudi") .option(HoodieWriteConfig.TBL_NAME.key, tableName) .option(TABLE_TYPE.key, MOR_TABLE_TYPE_OPT_VAL) .option(RECORDKEY_FIELD.key, "id") .option(PRECOMBINE_FIELD.key, "ts") .option(PARTITIONPATH_FIELD.key, "day,hh") .option(HoodieWriteConfig.INSERT_PARALLELISM_VALUE.key, "1") .option(HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key, "1") .option(HoodieWriteConfig.ALLOW_OPERATION_METADATA_FIELD.key, "true") .mode(SaveMode.Overwrite) .save(tablePath) } } futures(i) = executors.submit(overwriteTask) } futures.foreach(f => f.get()) futures = new Array(threadCount) for (i <- 0 until threadCount) { val overwriteTask = new Runnable { override def run(): Unit = { val tableName = "table_name" + (12 - i) val tablePath = s"${tmp.getCanonicalPath}/$tableName" // Write a table by spark dataframe. df.write.format("hudi") .option(HoodieWriteConfig.TBL_NAME.key, tableName) .option(TABLE_TYPE.key, MOR_TABLE_TYPE_OPT_VAL) .option(RECORDKEY_FIELD.key, "id") .option(PRECOMBINE_FIELD.key, "ts") .option(PARTITIONPATH_FIELD.key, "day,hh") .option(HoodieWriteConfig.INSERT_PARALLELISM_VALUE.key, "1") .option(HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key, "1") .option(HoodieWriteConfig.ALLOW_OPERATION_METADATA_FIELD.key, "true") .mode(SaveMode.Append) .save(tablePath) } } futures(i) = executors.submit(overwriteTask) } futures.foreach(f => f.get()) } } ``` And we will get exception stack trace sometimes like this: ```txt Caused by: org.apache.hudi.exception.TableNotFoundException: Hoodie table not found in path /private/var/folders/q1/_zbtr5t97rz27jb_f3ph8chmgp/T/spark-d9e6236f-4d31-4ea1-a60a-df21c5d1d545/table_name12/.hoodie at org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:57) at org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:149) at org.apache.hudi.common.table.HoodieTableMetaClient.newMetaClient(HoodieTableMetaClient.java:735) at org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:91) at org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:826) at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$getHoodieTableConfig$1(HoodieSparkSqlWriter.scala:1165) at scala.Option.getOrElse(Option.scala:189) at org.apache.hudi.HoodieSparkSqlWriter$.getHoodieTableConfig(HoodieSparkSqlWriter.scala:1166) at org.apache.hudi.HoodieSparkSqlWriter$.writeInternal(HoodieSparkSqlWriter.scala:172) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:133) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-6653) Support position-based merging of base and log files
[ https://issues.apache.org/jira/browse/HUDI-6653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-6653: --- Assignee: Lin Liu (was: Ethan Guo) > Support position-based merging of base and log files > > > Key: HUDI-6653 > URL: https://issues.apache.org/jira/browse/HUDI-6653 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Lin Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6653) Support position-based merging of base and log files
[ https://issues.apache.org/jira/browse/HUDI-6653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6653: Status: In Progress (was: Open) > Support position-based merging of base and log files > > > Key: HUDI-6653 > URL: https://issues.apache.org/jira/browse/HUDI-6653 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6653) Support position-based merging of base and log files
[ https://issues.apache.org/jira/browse/HUDI-6653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6653: Status: Patch Available (was: In Progress) > Support position-based merging of base and log files > > > Key: HUDI-6653 > URL: https://issues.apache.org/jira/browse/HUDI-6653 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Lin Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6798) Implement event-time-based merging mode in FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-6798: --- Assignee: Ethan Guo > Implement event-time-based merging mode in FileGroupReader > -- > > Key: HUDI-6798 > URL: https://issues.apache.org/jira/browse/HUDI-6798 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6788) Integrate FileGroupReader with MergeOnReadInputFormat for Flink
[ https://issues.apache.org/jira/browse/HUDI-6788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-6788: --- Assignee: (was: Ethan Guo) > Integrate FileGroupReader with MergeOnReadInputFormat for Flink > --- > > Key: HUDI-6788 > URL: https://issues.apache.org/jira/browse/HUDI-6788 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6787) Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and RealtimeCompactedRecordReader for Hive
[ https://issues.apache.org/jira/browse/HUDI-6787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-6787: --- Assignee: (was: Ethan Guo) > Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and > RealtimeCompactedRecordReader for Hive > - > > Key: HUDI-6787 > URL: https://issues.apache.org/jira/browse/HUDI-6787 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6949) Spark support non-blocking concurrency control
[ https://issues.apache.org/jira/browse/HUDI-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhang reassigned HUDI-6949: Assignee: Jing Zhang > Spark support non-blocking concurrency control > -- > > Key: HUDI-6949 > URL: https://issues.apache.org/jira/browse/HUDI-6949 > Project: Apache Hudi > Issue Type: New Feature > Components: spark, spark-sql >Reporter: Jing Zhang >Assignee: Jing Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6801) Implement merging of partial updates in FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6801: Status: In Progress (was: Open) > Implement merging of partial updates in FileGroupReader > --- > > Key: HUDI-6801 > URL: https://issues.apache.org/jira/browse/HUDI-6801 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6800) Implement log writing with partial updates on the write path
[ https://issues.apache.org/jira/browse/HUDI-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6800: Status: Patch Available (was: In Progress) > Implement log writing with partial updates on the write path > > > Key: HUDI-6800 > URL: https://issues.apache.org/jira/browse/HUDI-6800 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6800) Implement log writing with partial updates on the write path
[ https://issues.apache.org/jira/browse/HUDI-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6800: Status: In Progress (was: Open) > Implement log writing with partial updates on the write path > > > Key: HUDI-6800 > URL: https://issues.apache.org/jira/browse/HUDI-6800 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6800) Implement log writing with partial updates on the write path
[ https://issues.apache.org/jira/browse/HUDI-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6800: - Labels: pull-request-available (was: ) > Implement log writing with partial updates on the write path > > > Key: HUDI-6800 > URL: https://issues.apache.org/jira/browse/HUDI-6800 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
yihua opened a new pull request, #9876: URL: https://github.com/apache/hudi/pull/9876 ### Change Logs This PR adds the functionality to write partial updates to the data blocks in MOR tables, for Spark SQL MERGE INTO. ### Impact Reduces write amplification ### Risk level medium ### Documentation Update New feature docs ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767481158 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * 9e8e32cf81bc88bf9b9cd2b5ebb26fa5d195e6cb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20369) * bc45850e7a2962242d4e99e88b07c89b8c8e19bf UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6952) Skip reading the uncommitted log files for log reader
Danny Chen created HUDI-6952: Summary: Skip reading the uncommitted log files for log reader Key: HUDI-6952 URL: https://issues.apache.org/jira/browse/HUDI-6952 Project: Apache Hudi Issue Type: Improvement Components: reader-core Reporter: Danny Chen Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6801) Implement merging of partial updates in FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-6801: --- Assignee: Ethan Guo > Implement merging of partial updates in FileGroupReader > --- > > Key: HUDI-6801 > URL: https://issues.apache.org/jira/browse/HUDI-6801 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6950) query should process listed partitions avoid driver oom due to large number files in table
[ https://issues.apache.org/jira/browse/HUDI-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6950. Resolution: Fixed Fixed via master branch: fae20cd12a0057c8dda7f302699f65a2fe335d0a > query should process listed partitions avoid driver oom due to large number > files in table > -- > > Key: HUDI-6950 > URL: https://issues.apache.org/jira/browse/HUDI-6950 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Affects Versions: 0.14.0 >Reporter: xy >Assignee: xy >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > Attachments: before_fix_dump_filestatus.jpg, dump.jpg, > fix_stages.jpg, oom_stages.jpg > > > currently if multiple partition table,would cause oom easy > eg: > CREATE TABLE {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} ( > {{id}} string, > {{name}} string, > {{dt}} bigint, > {{day}} STRING COMMENT '日期分区', > {{hour}} INT COMMENT '小时分区' > )using hudi > OPTIONS ('hoodie.datasource.write.hive_style_partitioning' 'false', > 'hoodie.datasource.meta.sync.enable' 'false', > 'hoodie.datasource.hive_sync.enable' 'false') > tblproperties ( > 'primaryKey' = 'id', > 'type' = 'mor', > 'preCombineField'='dt', > 'hoodie.index.type' = 'BUCKET', > 'hoodie.bucket.index.hash.field' = 'id', > 'hoodie.bucket.index.num.buckets'=512 > ) > PARTITIONED BY ({{{}day{}}},{{{}hour{}}}); > select count(1) from {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} where > day='2023-10-17' would list much filestatus to driver,and driver would > oom(such as table with hundreds billion records in a > partition(day='2023-10-17')) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-6950] Query should process listed partitions to avoid driver oom due to large number files in table first partition (#9875)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new fae20cd12a0 [HUDI-6950] Query should process listed partitions to avoid driver oom due to large number files in table first partition (#9875) fae20cd12a0 is described below commit fae20cd12a0057c8dda7f302699f65a2fe335d0a Author: xuzifu666 AuthorDate: Wed Oct 18 08:40:03 2023 +0800 [HUDI-6950] Query should process listed partitions to avoid driver oom due to large number files in table first partition (#9875) --- .../metadata/FileSystemBackedTableMetadata.java| 95 -- 1 file changed, 54 insertions(+), 41 deletions(-) diff --git a/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java b/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java index f4cd7c29074..3737793e0c6 100644 --- a/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java +++ b/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java @@ -54,7 +54,6 @@ import java.util.List; import java.util.Map; import java.util.concurrent.CopyOnWriteArrayList; import java.util.stream.Collectors; -import java.util.stream.Stream; /** * Implementation of {@link HoodieTableMetadata} based file-system-backed table metadata. @@ -157,52 +156,66 @@ public class FileSystemBackedTableMetadata extends AbstractHoodieTableMetadata { // TODO: Get the parallelism from HoodieWriteConfig int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size()); - // List all directories in parallel: - // if current dictionary contains PartitionMetadata, add it to result - // if current dictionary does not contain PartitionMetadata, add its subdirectory to queue to be processed. + // List all directories in parallel engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all partitions with prefix " + relativePathPrefix); - // result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions. - // and second entry holds optionally a directory path to be processed further. - List, Option>> result = engineContext.flatMap(pathsToList, path -> { + List dirToFileListing = engineContext.flatMap(pathsToList, path -> { FileSystem fileSystem = path.getFileSystem(hadoopConf.get()); -if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) { - return Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), path)), Option.empty())); -} -return Arrays.stream(fileSystem.listStatus(path)) -.filter(status -> status.isDirectory() && !status.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) -.map(status -> Pair.of(Option.empty(), Option.of(status.getPath(; +return Arrays.stream(fileSystem.listStatus(path)); }, listingParallelism); pathsToList.clear(); - partitionPaths.addAll(result.stream().filter(entry -> entry.getKey().isPresent()) - .map(entry -> entry.getKey().get()) - .filter(relativePartitionPath -> fullBoundExpr instanceof Predicates.TrueExpression - || (Boolean) fullBoundExpr.eval( - extractPartitionValues(partitionFields, relativePartitionPath, urlEncodePartitioningEnabled))) - .collect(Collectors.toList())); - - Expression partialBoundExpr; - // If partitionPaths is nonEmpty, we're already at the last path level, and all paths - // are filtered already. - if (needPushDownExpressions && partitionPaths.isEmpty()) { -// Here we assume the path level matches the number of partition columns, so we'll rebuild -// new schema based on current path level. -// e.g. partition columns are , if we're listing the second level, then -// currentSchema would be -// `PartialBindVisitor` will bind reference if it can be found from `currentSchema`, otherwise -// will change the expression to `alwaysTrue`. Can see `PartialBindVisitor` for details. -Types.RecordType currentSchema = Types.RecordType.get(partitionFields.fields().subList(0, ++currentPartitionLevel)); -PartialBindVisitor partialBindVisitor = new PartialBindVisitor(currentSchema, caseSensitive); -partialBoundExpr = pushedExpr.accept(partialBindVisitor); - } else { -partialBoundExpr = Predicates.alwaysTrue(); - } + // if current dictionary contains PartitionMetadata, add it to result + // if current dictionary does not contain PartitionMetadata, add it to queue to be processed. + int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, dirToFileLis
Re: [PR] [HUDI-6950] Query should process listed partitions to avoid driver oom due to large number files in table first partition [hudi]
danny0405 merged PR #9875: URL: https://github.com/apache/hudi/pull/9875 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Facing java.util.NoSuchElementException on EMR 6.12 (Hudi 0.13) with inline compaction and cleaning on MoR tables [hudi]
arunvasudevan commented on issue #9861: URL: https://github.com/apache/hudi/issues/9861#issuecomment-1767272065 Yes, checked the archive folder and it is empty in this case. Here are the writer configurtions. hoodie.datasource.hive_sync.database: hoodie.datasource.hive_sync.mode: HMS hoodie.datasource.write.precombine.field: source_ts_ms hoodie.datasource.hive_sync.partition_extractor_class: org.apache.hudi.hive.NonPartitionedExtractor hoodie.parquet.max.file.size: 67108864 hoodie.datasource.meta.sync.enable: true hoodie.datasource.hive_sync.skip_ro_suffix: true hoodie.metadata.enable: false hoodie.datasource.hive_sync.table: hoodie.index.type: SIMPLE hoodie.clean.automatic: true hoodie.datasource.write.operation: upsert hoodie.metrics.reporter.type: CLOUDWATCH hoodie.datasource.hive_sync.enable: true hoodie.datasource.write.recordkey.field: version_id hoodie.table.name: ride_version hoodie.datasource.hive_sync.jdbcurl: jdbc:hive2://ip-:1 hoodie.datasource.write.table.type: MERGE_ON_READ hoodie.simple.index.parallelism: 240 hoodie.write.lock.dynamodb.partition_key: hoodie.cleaner.policy: KEEP_LATEST_BY_HOURS hoodie.compact.inline: true hoodie.client.heartbeat.interval_in_ms: 60 hoodie.datasource.compaction.async.enable: true hoodie.metrics.on: true hoodie.datasource.write.keygenerator.class: org.apache.hudi.keygen.NonpartitionedKeyGenerator hoodie.cleaner.policy.failed.writes: LAZY hoodie.keep.max.commits: 1650 hoodie.cleaner.hours.retained: 168 hoodie.write.lock.dynamodb.table: peloton-prod-hudi-write-lock hoodie.write.lock.provider: org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider hoodie.keep.min.commits: 1600 hoodie.datasource.write.partitionpath.field: hoodie.compact.inline.max.delta.commits: 1 hoodie.write.concurrency.mode: optimistic_concurrency_control hoodie.write.lock.dynamodb.region: us-east-1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767246606 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * 9e8e32cf81bc88bf9b9cd2b5ebb26fa5d195e6cb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20369) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-6793) Support time-travel read in engine-agnostic FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu reassigned HUDI-6793: - Assignee: Lin Liu (was: Jonathan Vexler) > Support time-travel read in engine-agnostic FileGroupReader > --- > > Key: HUDI-6793 > URL: https://issues.apache.org/jira/browse/HUDI-6793 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Lin Liu >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6790) Support incremental read in engine-agnostic FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu updated HUDI-6790: -- Status: In Progress (was: Open) > Support incremental read in engine-agnostic FileGroupReader > --- > > Key: HUDI-6790 > URL: https://issues.apache.org/jira/browse/HUDI-6790 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Lin Liu >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6790) Support incremental read in engine-agnostic FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu reassigned HUDI-6790: - Assignee: Lin Liu (was: Jonathan Vexler) > Support incremental read in engine-agnostic FileGroupReader > --- > > Key: HUDI-6790 > URL: https://issues.apache.org/jira/browse/HUDI-6790 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Lin Liu >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767225223 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * 66df70555e5fc284eeedb1fdbfecbc141b03678a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20368) * 9e8e32cf81bc88bf9b9cd2b5ebb26fa5d195e6cb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20369) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767156435 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * ce8a55919d455f5582a0aa18069d57cbd645e37b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20362) * 66df70555e5fc284eeedb1fdbfecbc141b03678a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20368) * 9e8e32cf81bc88bf9b9cd2b5ebb26fa5d195e6cb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767139167 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * ce8a55919d455f5582a0aa18069d57cbd645e37b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20362) * 66df70555e5fc284eeedb1fdbfecbc141b03678a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
lokesh-lingarajan-0310 commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767096533 @the-other-tim-brown - What would the behavior be when this is false and schema evolution is enabled? Is there an option where it would auto-drop the column in the target table? Reply - Currently the plan is to support all these evolution oob and not rely on the schema evolution flags. The idea of delete flag here is we are changing the default behavior for delete column in oob, so to not break backward compatibility, we will start with false as a default and after a couple of release we will put a warning note and make it default behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
lokesh-lingarajan-0310 commented on code in PR #9743: URL: https://github.com/apache/hudi/pull/9743#discussion_r1362693693 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala: ## @@ -538,6 +538,8 @@ object DataSourceWriteOptions { val RECONCILE_SCHEMA: ConfigProperty[java.lang.Boolean] = HoodieCommonConfig.RECONCILE_SCHEMA + val ADD_NULL_FOR_DELETED_COLUMNS: ConfigProperty[String] = HoodieCommonConfig.ADD_NULL_FOR_DELETED_COLUMNS Review Comment: For now we will keep it turned off so that we dont break backward compatability. If some oss users are relying on failing the streams with deleted columns, then making it a default evol for deleted columns will break those pipelines. Agree default true is better, but may be after a couple of release we can make that change IMO. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes [hudi]
Jason-liujc commented on issue #7653: URL: https://github.com/apache/hudi/issues/7653#issuecomment-1766886802 Can't speak to what the official guidance from Hudi is at the moment (seems like they will rollout the non-blocking concurent write feature in version 1.0+). We had to increase `yarn.resourcemanager.am.max-attempts` and `spark.yarn.maxAppAttempts` (the spark specific config) to make it retry more and reoganize our tables to reduce concurrent writes. Any other lock provider wasn't an option for us since we are running different jobs from different clusters. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
lokesh-lingarajan-0310 commented on code in PR #9743: URL: https://github.com/apache/hudi/pull/9743#discussion_r1362473548 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java: ## @@ -661,6 +652,35 @@ private Pair>> fetchFromSourc return Pair.of(schemaProvider, Pair.of(checkpointStr, records)); } + /** + * Apply schema reconcile and schema evolution rules(schema on read) and generate new target schema provider. + * + * @param incomingSchema schema of the source data + * @param sourceSchemaProvider Source schema provider. + * @return the SchemaProvider that can be used as writer schema. + */ + private SchemaProvider getDeducedSchemaProvider(Schema incomingSchema, SchemaProvider sourceSchemaProvider) { Review Comment: This function just picks up latest table schema for writing in case schema provider is set to NULL schema. All the evolution is handled in getDeducedSchemaProvider api -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-6800) Implement log writing with partial updates on the write path
[ https://issues.apache.org/jira/browse/HUDI-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-6800: --- Assignee: Ethan Guo > Implement log writing with partial updates on the write path > > > Key: HUDI-6800 > URL: https://issues.apache.org/jira/browse/HUDI-6800 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6800) Implement log writing with partial updates on the write path
[ https://issues.apache.org/jira/browse/HUDI-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Y Ethan Guo reassigned HUDI-6800: - Assignee: Y Ethan Guo > Implement log writing with partial updates on the write path > > > Key: HUDI-6800 > URL: https://issues.apache.org/jira/browse/HUDI-6800 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Y Ethan Guo >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6800) Implement log writing with partial updates on the write path
[ https://issues.apache.org/jira/browse/HUDI-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Y Ethan Guo reassigned HUDI-6800: - Assignee: (was: Y Ethan Guo) > Implement log writing with partial updates on the write path > > > Key: HUDI-6800 > URL: https://issues.apache.org/jira/browse/HUDI-6800 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6950) query should process listed partitions avoid driver oom due to large number files in table
[ https://issues.apache.org/jira/browse/HUDI-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xy updated HUDI-6950: - Attachment: dump.jpg > query should process listed partitions avoid driver oom due to large number > files in table > -- > > Key: HUDI-6950 > URL: https://issues.apache.org/jira/browse/HUDI-6950 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Affects Versions: 0.14.0 >Reporter: xy >Assignee: xy >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > Attachments: before_fix_dump_filestatus.jpg, dump.jpg, > fix_stages.jpg, oom_stages.jpg > > > currently if multiple partition table,would cause oom easy > eg: > CREATE TABLE {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} ( > {{id}} string, > {{name}} string, > {{dt}} bigint, > {{day}} STRING COMMENT '日期分区', > {{hour}} INT COMMENT '小时分区' > )using hudi > OPTIONS ('hoodie.datasource.write.hive_style_partitioning' 'false', > 'hoodie.datasource.meta.sync.enable' 'false', > 'hoodie.datasource.hive_sync.enable' 'false') > tblproperties ( > 'primaryKey' = 'id', > 'type' = 'mor', > 'preCombineField'='dt', > 'hoodie.index.type' = 'BUCKET', > 'hoodie.bucket.index.hash.field' = 'id', > 'hoodie.bucket.index.num.buckets'=512 > ) > PARTITIONED BY ({{{}day{}}},{{{}hour{}}}); > select count(1) from {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} where > day='2023-10-17' would list much filestatus to driver,and driver would > oom(such as table with hundreds billion records in a > partition(day='2023-10-17')) -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]
hudi-bot commented on PR #9875: URL: https://github.com/apache/hudi/pull/9875#issuecomment-1766677931 ## CI report: * eeb64f5c3c4a8ff572e0637d037cf4b4823db1e0 UNKNOWN * 118b8ea2524be1bdf5c540837a78d83bdee7fa62 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20367) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Apache Flink 1.16 with Apache Hudi [hudi]
mintesnot77 commented on issue #9777: URL: https://github.com/apache/hudi/issues/9777#issuecomment-175756 hello i need your help i am new for hudi i try to integrate apache hudi with my hadoop cluster (1 master and 3 slaves but master one also act as slaves) but i dont understand how to configure the hudi conf i just give the basepath hdfs:my-master-ip-address/sth just following the quick start https://hudi.apache.org/docs/quick-start-guide but i got a lot of error like below Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@2fbdf32e, see the next exception for details. at org.apache.derby.iapi.error.StandardException.newException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown Source) ... 143 more Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /home/hduser_/metastore_db. at org.apache.derby.iapi.error.StandardException.newException(Unknown Source) at org.apache.derby.iapi.error.StandardException.newException(Unknown Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source) at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source) at org.apache.derby.impl.services.monitor.FileMonitor.startModule(Unknown Source) at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source) at org.apache.derby.impl.store.raw.RawStore$6.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at org.apache.derby.impl.store.raw.RawStore.bootServiceModule(Unknown Source) at org.apache.derby.impl.store.raw.RawStore.boot(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source) at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source) at org.apache.derby.impl.services.monitor.FileMonitor.startModule(Unknown Source) at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source) at org.apache.derby.impl.store.access.RAMAccessManager$5.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at org.apache.derby.impl.store.access.RAMAccessManager.bootServiceModule(Unknown Source) at org.apache.derby.impl.store.access.RAMAccessManager.boot(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source) at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source) at org.apache.derby.impl.services.monitor.FileMonitor.startModule(Unknown Source) at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source) at org.apache.derby.impl.db.BasicDatabase$5.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at org.apache.derby.impl.db.BasicDatabase.bootServiceModule(Unknown Source) at org.apache.derby.impl.db.BasicDatabase.bootStore(Unknown Source) at org.apache.derby.impl.db.BasicDatabase.boot(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source) at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.bootService(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.startProviderService(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.findProviderAndStartService(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.startPersistentService(Unknown Source) at org.apache.derby.iapi.services.monitor.Monitor.startPersistentService(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection$4.run(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection$4.run(Unknown Source)
Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]
xuzifu666 commented on code in PR #9875: URL: https://github.com/apache/hudi/pull/9875#discussion_r1362328775 ## hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java: ## @@ -157,52 +156,66 @@ private List getPartitionPathWithPathPrefixUsingFilterExpression(String // TODO: Get the parallelism from HoodieWriteConfig int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size()); - // List all directories in parallel: - // if current dictionary contains PartitionMetadata, add it to result - // if current dictionary does not contain PartitionMetadata, add its subdirectory to queue to be processed. + // List all directories in parallel engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all partitions with prefix " + relativePathPrefix); - // result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions. - // and second entry holds optionally a directory path to be processed further. - List, Option>> result = engineContext.flatMap(pathsToList, path -> { + List dirToFileListing = engineContext.flatMap(pathsToList, path -> { FileSystem fileSystem = path.getFileSystem(hadoopConf.get()); -if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) { - return Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), path)), Option.empty())); -} -return Arrays.stream(fileSystem.listStatus(path)) -.filter(status -> status.isDirectory() && !status.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) -.map(status -> Pair.of(Option.empty(), Option.of(status.getPath(; +return Arrays.stream(fileSystem.listStatus(path)); }, listingParallelism); pathsToList.clear(); - partitionPaths.addAll(result.stream().filter(entry -> entry.getKey().isPresent()) - .map(entry -> entry.getKey().get()) - .filter(relativePartitionPath -> fullBoundExpr instanceof Predicates.TrueExpression - || (Boolean) fullBoundExpr.eval( - extractPartitionValues(partitionFields, relativePartitionPath, urlEncodePartitioningEnabled))) - .collect(Collectors.toList())); - - Expression partialBoundExpr; - // If partitionPaths is nonEmpty, we're already at the last path level, and all paths - // are filtered already. - if (needPushDownExpressions && partitionPaths.isEmpty()) { -// Here we assume the path level matches the number of partition columns, so we'll rebuild -// new schema based on current path level. -// e.g. partition columns are , if we're listing the second level, then -// currentSchema would be -// `PartialBindVisitor` will bind reference if it can be found from `currentSchema`, otherwise -// will change the expression to `alwaysTrue`. Can see `PartialBindVisitor` for details. -Types.RecordType currentSchema = Types.RecordType.get(partitionFields.fields().subList(0, ++currentPartitionLevel)); -PartialBindVisitor partialBindVisitor = new PartialBindVisitor(currentSchema, caseSensitive); -partialBoundExpr = pushedExpr.accept(partialBindVisitor); - } else { -partialBoundExpr = Predicates.alwaysTrue(); - } + // if current dictionary contains PartitionMetadata, add it to result + // if current dictionary does not contain PartitionMetadata, add it to queue to be processed. + int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, dirToFileListing.size()); + if (!dirToFileListing.isEmpty()) { +// result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions. +// and second entry holds optionally a directory path to be processed further. +engineContext.setJobStatus(this.getClass().getSimpleName(), "Processing listed partitions"); +List, Option>> result = engineContext.map(dirToFileListing, fileStatus -> { + FileSystem fileSystem = fileStatus.getPath().getFileSystem(hadoopConf.get()); + if (fileStatus.isDirectory()) { +if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, fileStatus.getPath())) { + return Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), fileStatus.getPath())), Option.empty()); +} else if (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) { + return Pair.of(Option.empty(), Option.of(fileStatus.getPath())); Review Comment: > ok,in a condition that day = 2023-10-13 partition are 2000 records(1kb per record),driver memory is 4gb ,sub parition 'hour' from 1 to 24,than query select count(1) from table where day='2023-10-13' or select * from table where day='2023-10-13',driv
Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]
xuzifu666 commented on code in PR #9875: URL: https://github.com/apache/hudi/pull/9875#discussion_r1362328775 ## hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java: ## @@ -157,52 +156,66 @@ private List getPartitionPathWithPathPrefixUsingFilterExpression(String // TODO: Get the parallelism from HoodieWriteConfig int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size()); - // List all directories in parallel: - // if current dictionary contains PartitionMetadata, add it to result - // if current dictionary does not contain PartitionMetadata, add its subdirectory to queue to be processed. + // List all directories in parallel engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all partitions with prefix " + relativePathPrefix); - // result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions. - // and second entry holds optionally a directory path to be processed further. - List, Option>> result = engineContext.flatMap(pathsToList, path -> { + List dirToFileListing = engineContext.flatMap(pathsToList, path -> { FileSystem fileSystem = path.getFileSystem(hadoopConf.get()); -if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) { - return Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), path)), Option.empty())); -} -return Arrays.stream(fileSystem.listStatus(path)) -.filter(status -> status.isDirectory() && !status.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) -.map(status -> Pair.of(Option.empty(), Option.of(status.getPath(; +return Arrays.stream(fileSystem.listStatus(path)); }, listingParallelism); pathsToList.clear(); - partitionPaths.addAll(result.stream().filter(entry -> entry.getKey().isPresent()) - .map(entry -> entry.getKey().get()) - .filter(relativePartitionPath -> fullBoundExpr instanceof Predicates.TrueExpression - || (Boolean) fullBoundExpr.eval( - extractPartitionValues(partitionFields, relativePartitionPath, urlEncodePartitioningEnabled))) - .collect(Collectors.toList())); - - Expression partialBoundExpr; - // If partitionPaths is nonEmpty, we're already at the last path level, and all paths - // are filtered already. - if (needPushDownExpressions && partitionPaths.isEmpty()) { -// Here we assume the path level matches the number of partition columns, so we'll rebuild -// new schema based on current path level. -// e.g. partition columns are , if we're listing the second level, then -// currentSchema would be -// `PartialBindVisitor` will bind reference if it can be found from `currentSchema`, otherwise -// will change the expression to `alwaysTrue`. Can see `PartialBindVisitor` for details. -Types.RecordType currentSchema = Types.RecordType.get(partitionFields.fields().subList(0, ++currentPartitionLevel)); -PartialBindVisitor partialBindVisitor = new PartialBindVisitor(currentSchema, caseSensitive); -partialBoundExpr = pushedExpr.accept(partialBindVisitor); - } else { -partialBoundExpr = Predicates.alwaysTrue(); - } + // if current dictionary contains PartitionMetadata, add it to result + // if current dictionary does not contain PartitionMetadata, add it to queue to be processed. + int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, dirToFileListing.size()); + if (!dirToFileListing.isEmpty()) { +// result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions. +// and second entry holds optionally a directory path to be processed further. +engineContext.setJobStatus(this.getClass().getSimpleName(), "Processing listed partitions"); +List, Option>> result = engineContext.map(dirToFileListing, fileStatus -> { + FileSystem fileSystem = fileStatus.getPath().getFileSystem(hadoopConf.get()); + if (fileStatus.isDirectory()) { +if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, fileStatus.getPath())) { + return Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), fileStatus.getPath())), Option.empty()); +} else if (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) { + return Pair.of(Option.empty(), Option.of(fileStatus.getPath())); Review Comment: > ok,in a condition that day = 2023-10-13 partition are 2000 records(1kb per record),driver memory is 4gb ,sub parition 'hour' from 1 to 24,than query select count(1) from table where day='2023-10-13' or select * from table where day='2023-10-13',driv
Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]
hudi-bot commented on PR #9875: URL: https://github.com/apache/hudi/pull/9875#issuecomment-1766659212 ## CI report: * eeb64f5c3c4a8ff572e0637d037cf4b4823db1e0 UNKNOWN * 1d8c05e27a6d83320e2eedae074e1aba01146923 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20366) * 118b8ea2524be1bdf5c540837a78d83bdee7fa62 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20367) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi Job fails fast in concurrent write even with high retries and long wait time [hudi]
ad1happy2go commented on issue #9728: URL: https://github.com/apache/hudi/issues/9728#issuecomment-1766658535 @SamarthRaval @Jason-liujc As discussed, the retry configuration is unrelated to the problem you are facing. The only way to handle such scenario's at this moment will be handling retries at your application level code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]
xuzifu666 commented on code in PR #9875: URL: https://github.com/apache/hudi/pull/9875#discussion_r1362323212 ## hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java: ## @@ -157,52 +156,66 @@ private List getPartitionPathWithPathPrefixUsingFilterExpression(String // TODO: Get the parallelism from HoodieWriteConfig int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size()); - // List all directories in parallel: - // if current dictionary contains PartitionMetadata, add it to result - // if current dictionary does not contain PartitionMetadata, add its subdirectory to queue to be processed. + // List all directories in parallel engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all partitions with prefix " + relativePathPrefix); - // result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions. - // and second entry holds optionally a directory path to be processed further. - List, Option>> result = engineContext.flatMap(pathsToList, path -> { + List dirToFileListing = engineContext.flatMap(pathsToList, path -> { FileSystem fileSystem = path.getFileSystem(hadoopConf.get()); -if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) { - return Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), path)), Option.empty())); -} -return Arrays.stream(fileSystem.listStatus(path)) -.filter(status -> status.isDirectory() && !status.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) -.map(status -> Pair.of(Option.empty(), Option.of(status.getPath(; +return Arrays.stream(fileSystem.listStatus(path)); }, listingParallelism); pathsToList.clear(); - partitionPaths.addAll(result.stream().filter(entry -> entry.getKey().isPresent()) - .map(entry -> entry.getKey().get()) - .filter(relativePartitionPath -> fullBoundExpr instanceof Predicates.TrueExpression - || (Boolean) fullBoundExpr.eval( - extractPartitionValues(partitionFields, relativePartitionPath, urlEncodePartitioningEnabled))) - .collect(Collectors.toList())); - - Expression partialBoundExpr; - // If partitionPaths is nonEmpty, we're already at the last path level, and all paths - // are filtered already. - if (needPushDownExpressions && partitionPaths.isEmpty()) { -// Here we assume the path level matches the number of partition columns, so we'll rebuild -// new schema based on current path level. -// e.g. partition columns are , if we're listing the second level, then -// currentSchema would be -// `PartialBindVisitor` will bind reference if it can be found from `currentSchema`, otherwise -// will change the expression to `alwaysTrue`. Can see `PartialBindVisitor` for details. -Types.RecordType currentSchema = Types.RecordType.get(partitionFields.fields().subList(0, ++currentPartitionLevel)); -PartialBindVisitor partialBindVisitor = new PartialBindVisitor(currentSchema, caseSensitive); -partialBoundExpr = pushedExpr.accept(partialBindVisitor); - } else { -partialBoundExpr = Predicates.alwaysTrue(); - } + // if current dictionary contains PartitionMetadata, add it to result + // if current dictionary does not contain PartitionMetadata, add it to queue to be processed. + int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, dirToFileListing.size()); + if (!dirToFileListing.isEmpty()) { +// result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions. +// and second entry holds optionally a directory path to be processed further. +engineContext.setJobStatus(this.getClass().getSimpleName(), "Processing listed partitions"); +List, Option>> result = engineContext.map(dirToFileListing, fileStatus -> { + FileSystem fileSystem = fileStatus.getPath().getFileSystem(hadoopConf.get()); + if (fileStatus.isDirectory()) { +if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, fileStatus.getPath())) { + return Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), fileStatus.getPath())), Option.empty()); +} else if (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) { + return Pair.of(Option.empty(), Option.of(fileStatus.getPath())); Review Comment: ok,in a condition that day = 2023-10-13 partition are 2000 records(1kb per record),driver memory is 4gb ,sub parition 'hour' from 1 to 24,than query select count(1) from table where day='2023-10-13' or select * from table where day='2023-10-13',driver
Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]
xuzifu666 commented on code in PR #9875: URL: https://github.com/apache/hudi/pull/9875#discussion_r1362323212 ## hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java: ## @@ -157,52 +156,66 @@ private List getPartitionPathWithPathPrefixUsingFilterExpression(String // TODO: Get the parallelism from HoodieWriteConfig int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size()); - // List all directories in parallel: - // if current dictionary contains PartitionMetadata, add it to result - // if current dictionary does not contain PartitionMetadata, add its subdirectory to queue to be processed. + // List all directories in parallel engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all partitions with prefix " + relativePathPrefix); - // result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions. - // and second entry holds optionally a directory path to be processed further. - List, Option>> result = engineContext.flatMap(pathsToList, path -> { + List dirToFileListing = engineContext.flatMap(pathsToList, path -> { FileSystem fileSystem = path.getFileSystem(hadoopConf.get()); -if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) { - return Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), path)), Option.empty())); -} -return Arrays.stream(fileSystem.listStatus(path)) -.filter(status -> status.isDirectory() && !status.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) -.map(status -> Pair.of(Option.empty(), Option.of(status.getPath(; +return Arrays.stream(fileSystem.listStatus(path)); }, listingParallelism); pathsToList.clear(); - partitionPaths.addAll(result.stream().filter(entry -> entry.getKey().isPresent()) - .map(entry -> entry.getKey().get()) - .filter(relativePartitionPath -> fullBoundExpr instanceof Predicates.TrueExpression - || (Boolean) fullBoundExpr.eval( - extractPartitionValues(partitionFields, relativePartitionPath, urlEncodePartitioningEnabled))) - .collect(Collectors.toList())); - - Expression partialBoundExpr; - // If partitionPaths is nonEmpty, we're already at the last path level, and all paths - // are filtered already. - if (needPushDownExpressions && partitionPaths.isEmpty()) { -// Here we assume the path level matches the number of partition columns, so we'll rebuild -// new schema based on current path level. -// e.g. partition columns are , if we're listing the second level, then -// currentSchema would be -// `PartialBindVisitor` will bind reference if it can be found from `currentSchema`, otherwise -// will change the expression to `alwaysTrue`. Can see `PartialBindVisitor` for details. -Types.RecordType currentSchema = Types.RecordType.get(partitionFields.fields().subList(0, ++currentPartitionLevel)); -PartialBindVisitor partialBindVisitor = new PartialBindVisitor(currentSchema, caseSensitive); -partialBoundExpr = pushedExpr.accept(partialBindVisitor); - } else { -partialBoundExpr = Predicates.alwaysTrue(); - } + // if current dictionary contains PartitionMetadata, add it to result + // if current dictionary does not contain PartitionMetadata, add it to queue to be processed. + int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, dirToFileListing.size()); + if (!dirToFileListing.isEmpty()) { +// result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions. +// and second entry holds optionally a directory path to be processed further. +engineContext.setJobStatus(this.getClass().getSimpleName(), "Processing listed partitions"); +List, Option>> result = engineContext.map(dirToFileListing, fileStatus -> { + FileSystem fileSystem = fileStatus.getPath().getFileSystem(hadoopConf.get()); + if (fileStatus.isDirectory()) { +if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, fileStatus.getPath())) { + return Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), fileStatus.getPath())), Option.empty()); +} else if (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) { + return Pair.of(Option.empty(), Option.of(fileStatus.getPath())); Review Comment: ok,in a condition that /2023-10-13 partition are 2000 records(1kb per record),driver memory is 4gb ,sub parition hour from 1 to 24,than query select count(1) from table where day='2023-10-13' or select * from table where day='2023-10-13',driver would
Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]
wecharyu commented on code in PR #9875: URL: https://github.com/apache/hudi/pull/9875#discussion_r1362311711 ## hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java: ## @@ -157,52 +156,66 @@ private List getPartitionPathWithPathPrefixUsingFilterExpression(String // TODO: Get the parallelism from HoodieWriteConfig int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size()); - // List all directories in parallel: - // if current dictionary contains PartitionMetadata, add it to result - // if current dictionary does not contain PartitionMetadata, add its subdirectory to queue to be processed. + // List all directories in parallel engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all partitions with prefix " + relativePathPrefix); - // result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions. - // and second entry holds optionally a directory path to be processed further. - List, Option>> result = engineContext.flatMap(pathsToList, path -> { + List dirToFileListing = engineContext.flatMap(pathsToList, path -> { FileSystem fileSystem = path.getFileSystem(hadoopConf.get()); -if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) { - return Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), path)), Option.empty())); -} -return Arrays.stream(fileSystem.listStatus(path)) -.filter(status -> status.isDirectory() && !status.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) -.map(status -> Pair.of(Option.empty(), Option.of(status.getPath(; +return Arrays.stream(fileSystem.listStatus(path)); }, listingParallelism); pathsToList.clear(); - partitionPaths.addAll(result.stream().filter(entry -> entry.getKey().isPresent()) - .map(entry -> entry.getKey().get()) - .filter(relativePartitionPath -> fullBoundExpr instanceof Predicates.TrueExpression - || (Boolean) fullBoundExpr.eval( - extractPartitionValues(partitionFields, relativePartitionPath, urlEncodePartitioningEnabled))) - .collect(Collectors.toList())); - - Expression partialBoundExpr; - // If partitionPaths is nonEmpty, we're already at the last path level, and all paths - // are filtered already. - if (needPushDownExpressions && partitionPaths.isEmpty()) { -// Here we assume the path level matches the number of partition columns, so we'll rebuild -// new schema based on current path level. -// e.g. partition columns are , if we're listing the second level, then -// currentSchema would be -// `PartialBindVisitor` will bind reference if it can be found from `currentSchema`, otherwise -// will change the expression to `alwaysTrue`. Can see `PartialBindVisitor` for details. -Types.RecordType currentSchema = Types.RecordType.get(partitionFields.fields().subList(0, ++currentPartitionLevel)); -PartialBindVisitor partialBindVisitor = new PartialBindVisitor(currentSchema, caseSensitive); -partialBoundExpr = pushedExpr.accept(partialBindVisitor); - } else { -partialBoundExpr = Predicates.alwaysTrue(); - } + // if current dictionary contains PartitionMetadata, add it to result + // if current dictionary does not contain PartitionMetadata, add it to queue to be processed. + int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, dirToFileListing.size()); + if (!dirToFileListing.isEmpty()) { +// result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions. +// and second entry holds optionally a directory path to be processed further. +engineContext.setJobStatus(this.getClass().getSimpleName(), "Processing listed partitions"); +List, Option>> result = engineContext.map(dirToFileListing, fileStatus -> { + FileSystem fileSystem = fileStatus.getPath().getFileSystem(hadoopConf.get()); + if (fileStatus.isDirectory()) { +if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, fileStatus.getPath())) { + return Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), fileStatus.getPath())), Option.empty()); +} else if (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) { + return Pair.of(Option.empty(), Option.of(fileStatus.getPath())); Review Comment: @xuzifu666 "Processing listed partitions" will left the intermediate path to call listStatus in the next iterator, which is the same as community version now. I have test the query `select count(1) from hudi_test where day='2023-10-17'`, which only lis
[jira] [Created] (HUDI-6951) Use spark3 profile to build hudi-aws-bundle jars for release artifacts
Akira Ajisaka created HUDI-6951: --- Summary: Use spark3 profile to build hudi-aws-bundle jars for release artifacts Key: HUDI-6951 URL: https://issues.apache.org/jira/browse/HUDI-6951 Project: Apache Hudi Issue Type: Improvement Reporter: Akira Ajisaka When hudi-aws-bundle.jar and hudi-spark3.3-bundle_2.12.jar are used at the same time, and hudi-aws-bundle.jar is loaded first in the Spark runtime, it can fails by NoSuchMethodError: {noformat} py4j.protocol.Py4JJavaError: An error occurred while calling ***. : java.lang.NoSuchMethodError: org.apache.hudi.avro.model.HoodieCleanMetadata.getTotalFilesDeleted()I at org.apache.hudi.client.BaseHoodieTableServiceClient.clean(BaseHoodieTableServiceClient.java:557 {noformat} The problem is, currently hudi-aws-bundle jar in Maven central repo is built against spark2 profile and Avro 1.8.2 is used to generate source code from Avro schema file. Then, the generated source code is like {noformat} public Integer getTotalFilesDeleted() { return this.totalFilesDeleted; } {noformat} on the other hand, hudi-spark3.3-bundle_2.12.jar is built with Avro 1.11.1, and the generated source code is like {noformat} public int getTotalFilesDeleted() { return this.totalFilesDeleted; } {noformat} Since Avro 1.9.0, it uses primitive type for generated getters/setters (AVRO-2069). Therefore, if hudi-aws-bundle is loaded first in the runtime, it can fail with the above NoSuchMethodError. Although it can be fixed by changing the classpath loading order or building hudi-aws-bundle by your own, is it possible to provide hudi-aws-spark3.3-bundle.jar in Maven central? or, is it possible to build hudi-aws-bundle jar using spark3 profile by default given most of AWS customer now use Spark 3.x for their runtime? -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT]Loss record when complete compaction [hudi]
ad1happy2go commented on issue #9869: URL: https://github.com/apache/hudi/issues/9869#issuecomment-1766432963 @15663671003 Can you please explain more in detail. Can you try setting spark.sql.filesourceTableRelationCacheSize to 0 to avoid any possibility of cached relation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]
hudi-bot commented on PR #9875: URL: https://github.com/apache/hudi/pull/9875#issuecomment-1766431378 ## CI report: * eeb64f5c3c4a8ff572e0637d037cf4b4823db1e0 UNKNOWN * e0f5b1e46f816ecd22e182ea58ba6454fb478cdf Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20365) * 1d8c05e27a6d83320e2eedae074e1aba01146923 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20366) * 118b8ea2524be1bdf5c540837a78d83bdee7fa62 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20367) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]
hudi-bot commented on PR #9875: URL: https://github.com/apache/hudi/pull/9875#issuecomment-1766357395 ## CI report: * eeb64f5c3c4a8ff572e0637d037cf4b4823db1e0 UNKNOWN * b783334d03f247b2e57ee788e3d019b14abf2b66 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20364) * e0f5b1e46f816ecd22e182ea58ba6454fb478cdf Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20365) * 1d8c05e27a6d83320e2eedae074e1aba01146923 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20366) * 118b8ea2524be1bdf5c540837a78d83bdee7fa62 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20367) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6878] Fix Overwrite error when ingest multiple tables [hudi]
hudi-bot commented on PR #9749: URL: https://github.com/apache/hudi/pull/9749#issuecomment-1766356922 ## CI report: * 149dfda8469d598e3098c418ce1e7bf99a4a177f UNKNOWN * 66ea14a95621e003cbf81773c78f0ad2147bbbf6 UNKNOWN * 918ff90b4bc079e5053fcc8a3b3f0d472d30ca1e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20363) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-6950) query should process listed partitions avoid driver oom due to large number files in table
[ https://issues.apache.org/jira/browse/HUDI-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xy reassigned HUDI-6950: Assignee: xy > query should process listed partitions avoid driver oom due to large number > files in table > -- > > Key: HUDI-6950 > URL: https://issues.apache.org/jira/browse/HUDI-6950 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Affects Versions: 0.14.0 >Reporter: xy >Assignee: xy >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > Attachments: before_fix_dump_filestatus.jpg, fix_stages.jpg, > oom_stages.jpg > > > currently if multiple partition table,would cause oom easy > eg: > CREATE TABLE {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} ( > {{id}} string, > {{name}} string, > {{dt}} bigint, > {{day}} STRING COMMENT '日期分区', > {{hour}} INT COMMENT '小时分区' > )using hudi > OPTIONS ('hoodie.datasource.write.hive_style_partitioning' 'false', > 'hoodie.datasource.meta.sync.enable' 'false', > 'hoodie.datasource.hive_sync.enable' 'false') > tblproperties ( > 'primaryKey' = 'id', > 'type' = 'mor', > 'preCombineField'='dt', > 'hoodie.index.type' = 'BUCKET', > 'hoodie.bucket.index.hash.field' = 'id', > 'hoodie.bucket.index.num.buckets'=512 > ) > PARTITIONED BY ({{{}day{}}},{{{}hour{}}}); > select count(1) from {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} where > day='2023-10-17' would list much filestatus to driver,and driver would > oom(such as table with hundreds billion records in a > partition(day='2023-10-17')) -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]
xuzifu666 commented on PR #9875: URL: https://github.com/apache/hudi/pull/9875#issuecomment-1766255314 > @wecharyu It is great if you have the review, @xuzifu666 can you supplement with more details, expecially the spark stages difference. sure,had add stages detail in issue https://issues.apache.org/jira/browse/HUDI-6950 @danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6950) query should process listed partitions avoid driver oom due to large number files in table
[ https://issues.apache.org/jira/browse/HUDI-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xy updated HUDI-6950: - Attachment: oom_stages.jpg > query should process listed partitions avoid driver oom due to large number > files in table > -- > > Key: HUDI-6950 > URL: https://issues.apache.org/jira/browse/HUDI-6950 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Affects Versions: 0.14.0 >Reporter: xy >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > Attachments: before_fix_dump_filestatus.jpg, fix_stages.jpg, > oom_stages.jpg > > > currently if multiple partition table,would cause oom easy > eg: > CREATE TABLE {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} ( > {{id}} string, > {{name}} string, > {{dt}} bigint, > {{day}} STRING COMMENT '日期分区', > {{hour}} INT COMMENT '小时分区' > )using hudi > OPTIONS ('hoodie.datasource.write.hive_style_partitioning' 'false', > 'hoodie.datasource.meta.sync.enable' 'false', > 'hoodie.datasource.hive_sync.enable' 'false') > tblproperties ( > 'primaryKey' = 'id', > 'type' = 'mor', > 'preCombineField'='dt', > 'hoodie.index.type' = 'BUCKET', > 'hoodie.bucket.index.hash.field' = 'id', > 'hoodie.bucket.index.num.buckets'=512 > ) > PARTITIONED BY ({{{}day{}}},{{{}hour{}}}); > select count(1) from {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} where > day='2023-10-17' would list much filestatus to driver,and driver would > oom(such as table with hundreds billion records in a > partition(day='2023-10-17')) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6950) query should process listed partitions avoid driver oom due to large number files in table
[ https://issues.apache.org/jira/browse/HUDI-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xy updated HUDI-6950: - Attachment: fix_stages.jpg > query should process listed partitions avoid driver oom due to large number > files in table > -- > > Key: HUDI-6950 > URL: https://issues.apache.org/jira/browse/HUDI-6950 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Affects Versions: 0.14.0 >Reporter: xy >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > Attachments: before_fix_dump_filestatus.jpg, fix_stages.jpg > > > currently if multiple partition table,would cause oom easy > eg: > CREATE TABLE {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} ( > {{id}} string, > {{name}} string, > {{dt}} bigint, > {{day}} STRING COMMENT '日期分区', > {{hour}} INT COMMENT '小时分区' > )using hudi > OPTIONS ('hoodie.datasource.write.hive_style_partitioning' 'false', > 'hoodie.datasource.meta.sync.enable' 'false', > 'hoodie.datasource.hive_sync.enable' 'false') > tblproperties ( > 'primaryKey' = 'id', > 'type' = 'mor', > 'preCombineField'='dt', > 'hoodie.index.type' = 'BUCKET', > 'hoodie.bucket.index.hash.field' = 'id', > 'hoodie.bucket.index.num.buckets'=512 > ) > PARTITIONED BY ({{{}day{}}},{{{}hour{}}}); > select count(1) from {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} where > day='2023-10-17' would list much filestatus to driver,and driver would > oom(such as table with hundreds billion records in a > partition(day='2023-10-17')) -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]
hudi-bot commented on PR #9875: URL: https://github.com/apache/hudi/pull/9875#issuecomment-1766236018 ## CI report: * eeb64f5c3c4a8ff572e0637d037cf4b4823db1e0 UNKNOWN * b783334d03f247b2e57ee788e3d019b14abf2b66 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20364) * e0f5b1e46f816ecd22e182ea58ba6454fb478cdf Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20365) * 1d8c05e27a6d83320e2eedae074e1aba01146923 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20366) * 118b8ea2524be1bdf5c540837a78d83bdee7fa62 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20367) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]
hudi-bot commented on PR #9875: URL: https://github.com/apache/hudi/pull/9875#issuecomment-1766222435 ## CI report: * eeb64f5c3c4a8ff572e0637d037cf4b4823db1e0 UNKNOWN * b783334d03f247b2e57ee788e3d019b14abf2b66 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20364) * e0f5b1e46f816ecd22e182ea58ba6454fb478cdf Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20365) * 1d8c05e27a6d83320e2eedae074e1aba01146923 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20366) * 118b8ea2524be1bdf5c540837a78d83bdee7fa62 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]
danny0405 commented on PR #9875: URL: https://github.com/apache/hudi/pull/9875#issuecomment-1766219389 @wecharyu It is great if you have the review, @xuzifu666 can you supplement with more details, expecially the spark stages difference. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table [hudi]
hudi-bot commented on PR #9875: URL: https://github.com/apache/hudi/pull/9875#issuecomment-1766157215 ## CI report: * eeb64f5c3c4a8ff572e0637d037cf4b4823db1e0 UNKNOWN * b783334d03f247b2e57ee788e3d019b14abf2b66 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20364) * e0f5b1e46f816ecd22e182ea58ba6454fb478cdf Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20365) * 1d8c05e27a6d83320e2eedae074e1aba01146923 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-6950) query should process listed partitions avoid driver oom due to large number files in table
[ https://issues.apache.org/jira/browse/HUDI-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776138#comment-17776138 ] xy commented on HUDI-6950: -- fix in https://github.com/apache/hudi/pull/9875/files > query should process listed partitions avoid driver oom due to large number > files in table > -- > > Key: HUDI-6950 > URL: https://issues.apache.org/jira/browse/HUDI-6950 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Affects Versions: 0.14.0 >Reporter: xy >Priority: Critical > Fix For: 0.14.1 > > Attachments: before_fix_dump_filestatus.jpg > > > currently if multiple partition table,would cause oom easy > eg: > CREATE TABLE {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} ( > {{id}} string, > {{name}} string, > {{dt}} bigint, > {{day}} STRING COMMENT '日期分区', > {{hour}} INT COMMENT '小时分区' > )using hudi > OPTIONS ('hoodie.datasource.write.hive_style_partitioning' 'false', > 'hoodie.datasource.meta.sync.enable' 'false', > 'hoodie.datasource.hive_sync.enable' 'false') > tblproperties ( > 'primaryKey' = 'id', > 'type' = 'mor', > 'preCombineField'='dt', > 'hoodie.index.type' = 'BUCKET', > 'hoodie.bucket.index.hash.field' = 'id', > 'hoodie.bucket.index.num.buckets'=512 > ) > PARTITIONED BY ({{{}day{}}},{{{}hour{}}}); > select count(1) from {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} where > day='2023-10-17' would list much filestatus to driver,and driver would > oom(such as table with hundreds billion records in a > partition(day='2023-10-17')) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6950) query should process listed partitions avoid driver oom due to large number files in table
[ https://issues.apache.org/jira/browse/HUDI-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6950: - Labels: pull-request-available (was: ) > query should process listed partitions avoid driver oom due to large number > files in table > -- > > Key: HUDI-6950 > URL: https://issues.apache.org/jira/browse/HUDI-6950 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Affects Versions: 0.14.0 >Reporter: xy >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > Attachments: before_fix_dump_filestatus.jpg > > > currently if multiple partition table,would cause oom easy > eg: > CREATE TABLE {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} ( > {{id}} string, > {{name}} string, > {{dt}} bigint, > {{day}} STRING COMMENT '日期分区', > {{hour}} INT COMMENT '小时分区' > )using hudi > OPTIONS ('hoodie.datasource.write.hive_style_partitioning' 'false', > 'hoodie.datasource.meta.sync.enable' 'false', > 'hoodie.datasource.hive_sync.enable' 'false') > tblproperties ( > 'primaryKey' = 'id', > 'type' = 'mor', > 'preCombineField'='dt', > 'hoodie.index.type' = 'BUCKET', > 'hoodie.bucket.index.hash.field' = 'id', > 'hoodie.bucket.index.num.buckets'=512 > ) > PARTITIONED BY ({{{}day{}}},{{{}hour{}}}); > select count(1) from {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} where > day='2023-10-17' would list much filestatus to driver,and driver would > oom(such as table with hundreds billion records in a > partition(day='2023-10-17')) -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table [hudi]
hudi-bot commented on PR #9875: URL: https://github.com/apache/hudi/pull/9875#issuecomment-1766142135 ## CI report: * eeb64f5c3c4a8ff572e0637d037cf4b4823db1e0 UNKNOWN * b783334d03f247b2e57ee788e3d019b14abf2b66 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20364) * e0f5b1e46f816ecd22e182ea58ba6454fb478cdf UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6950) query should process listed partitions avoid driver oom due to large number files in table
[ https://issues.apache.org/jira/browse/HUDI-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xy updated HUDI-6950: - Description: currently if multiple partition table,would cause oom easy eg: CREATE TABLE {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} ( {{id}} string, {{name}} string, {{dt}} bigint, {{day}} STRING COMMENT '日期分区', {{hour}} INT COMMENT '小时分区' )using hudi OPTIONS ('hoodie.datasource.write.hive_style_partitioning' 'false', 'hoodie.datasource.meta.sync.enable' 'false', 'hoodie.datasource.hive_sync.enable' 'false') tblproperties ( 'primaryKey' = 'id', 'type' = 'mor', 'preCombineField'='dt', 'hoodie.index.type' = 'BUCKET', 'hoodie.bucket.index.hash.field' = 'id', 'hoodie.bucket.index.num.buckets'=512 ) PARTITIONED BY ({{{}day{}}},{{{}hour{}}}); select count(1) from {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} where day='2023-10-17' would list much filestatus to driver,and driver would oom(such as table with hundreds billion records in a partition(day='2023-10-17')) > query should process listed partitions avoid driver oom due to large number > files in table > -- > > Key: HUDI-6950 > URL: https://issues.apache.org/jira/browse/HUDI-6950 > Project: Apache Hudi > Issue Type: Bug >Reporter: xy >Priority: Critical > Attachments: before_fix_dump_filestatus.jpg > > > currently if multiple partition table,would cause oom easy > eg: > CREATE TABLE {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} ( > {{id}} string, > {{name}} string, > {{dt}} bigint, > {{day}} STRING COMMENT '日期分区', > {{hour}} INT COMMENT '小时分区' > )using hudi > OPTIONS ('hoodie.datasource.write.hive_style_partitioning' 'false', > 'hoodie.datasource.meta.sync.enable' 'false', > 'hoodie.datasource.hive_sync.enable' 'false') > tblproperties ( > 'primaryKey' = 'id', > 'type' = 'mor', > 'preCombineField'='dt', > 'hoodie.index.type' = 'BUCKET', > 'hoodie.bucket.index.hash.field' = 'id', > 'hoodie.bucket.index.num.buckets'=512 > ) > PARTITIONED BY ({{{}day{}}},{{{}hour{}}}); > select count(1) from {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} where > day='2023-10-17' would list much filestatus to driver,and driver would > oom(such as table with hundreds billion records in a > partition(day='2023-10-17')) -- This message was sent by Atlassian Jira (v8.20.10#820010)