[GitHub] [hudi] hudi-bot commented on pull request #6739: [HUDI-4851] Fixing handling of `UTF8String` w/in `InSet` operator
hudi-bot commented on PR #6739: URL: https://github.com/apache/hudi/pull/6739#issuecomment-1254563996 ## CI report: * 6756f0e59418c7de7a7ca0d47a3fd2ff0427f04a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11570) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6739: [HUDI-4851] Fixing handling of `UTF8String` w/in `InSet` operator
hudi-bot commented on PR #6739: URL: https://github.com/apache/hudi/pull/6739#issuecomment-1254560139 ## CI report: * 6756f0e59418c7de7a7ca0d47a3fd2ff0427f04a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6738: [HUDI-4895] Object store based lock provider
hudi-bot commented on PR #6738: URL: https://github.com/apache/hudi/pull/6738#issuecomment-1254556886 ## CI report: * c0c9616166bf46216cdaf9ff8d634770e325e472 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11567) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin opened a new pull request, #6739: [HUDI-4851] Fixing handling of `UTF8String` w/in `InSet` operator
alexeykudinkin opened a new pull request, #6739: URL: https://github.com/apache/hudi/pull/6739 ### Change Logs This is taking up the fix from https://github.com/apache/hudi/pull/6700, and adding the test for it ### Impact **Risk level: None ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6733: [HUDI-4880] Fix corrupted parquet file issue left over by cancelled compaction task
hudi-bot commented on PR #6733: URL: https://github.com/apache/hudi/pull/6733#issuecomment-1254523155 ## CI report: * fa31786d3256e2d0a40ae3c1f874d8f32a45ce82 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11566) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6734: [HUDI-3478][HUDI-4887] Use Avro as the format of persisted cdc data
hudi-bot commented on PR #6734: URL: https://github.com/apache/hudi/pull/6734#issuecomment-1254520367 ## CI report: * 3d9071b62050a2b72d2522098f2b3263ddf91e40 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11554) * 06c2dca18820ac062262e38deed409ed7d7b4d2b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11569) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6737: [HUDI-4373] Flink Consistent hashing bucket index write path code
hudi-bot commented on PR #6737: URL: https://github.com/apache/hudi/pull/6737#issuecomment-1254516648 ## CI report: * 5e745fc3455ec2ebdf06f1d3068d9c7a112e4987 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11565) * 63aaa03dbd85111385ce1cbf09ab5bc173a44c0f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11568) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6734: [HUDI-3478][HUDI-4887] Use Avro as the format of persisted cdc data
hudi-bot commented on PR #6734: URL: https://github.com/apache/hudi/pull/6734#issuecomment-1254516618 ## CI report: * 3d9071b62050a2b72d2522098f2b3263ddf91e40 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11554) * 06c2dca18820ac062262e38deed409ed7d7b4d2b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6737: [HUDI-4373] Flink Consistent hashing bucket index write path code
hudi-bot commented on PR #6737: URL: https://github.com/apache/hudi/pull/6737#issuecomment-125451 ## CI report: * 5e745fc3455ec2ebdf06f1d3068d9c7a112e4987 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11565) * 63aaa03dbd85111385ce1cbf09ab5bc173a44c0f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] eshu commented on issue #6283: [SUPPORT] No .marker files
eshu commented on issue #6283: URL: https://github.com/apache/hudi/issues/6283#issuecomment-1254494003 @nsivabalan Workaround is working, but the bug still exists. If workaround is a resolution, then yes, it is resolved. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6738: [HUDI-4895] Object store based lock provider
hudi-bot commented on PR #6738: URL: https://github.com/apache/hudi/pull/6738#issuecomment-1254476299 ## CI report: * c0c9616166bf46216cdaf9ff8d634770e325e472 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11567) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6733: [HUDI-4880] Fix corrupted parquet file issue left over by cancelled compaction task
hudi-bot commented on PR #6733: URL: https://github.com/apache/hudi/pull/6733#issuecomment-1254476263 ## CI report: * c7c9984860b14b40d3f716f1fc1f16dc70f548b4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11551) * fa31786d3256e2d0a40ae3c1f874d8f32a45ce82 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11566) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] IsisPolei commented on issue #6720: [SUPPORT]Caused by: org.apache.hudi.exception.HoodieRemoteException: Connect to 192.168.64.107:34446 [/192.168.64.107] failed: Connection refused (C
IsisPolei commented on issue #6720: URL: https://github.com/apache/hudi/issues/6720#issuecomment-1254475342 The origin problem is offline compaction. The HoodieJavaWriteClient doesn't support compact inline. @Override protected List compact(String compactionInstantTime, boolean shouldComplete) { throw new HoodieNotSupportedException("Compact is not supported in HoodieJavaClient"); } So i change my hudi client to SparkRDDWriteClient. This client works like a treat when using spark local mode and standalone mode(in the same host machine). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6738: [HUDI-4895] Object store based lock provider
hudi-bot commented on PR #6738: URL: https://github.com/apache/hudi/pull/6738#issuecomment-1254473579 ## CI report: * c0c9616166bf46216cdaf9ff8d634770e325e472 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6733: [HUDI-4880] Fix corrupted parquet file issue left over by cancelled compaction task
hudi-bot commented on PR #6733: URL: https://github.com/apache/hudi/pull/6733#issuecomment-1254473535 ## CI report: * c7c9984860b14b40d3f716f1fc1f16dc70f548b4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11551) * fa31786d3256e2d0a40ae3c1f874d8f32a45ce82 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6737: [HUDI-4373] Flink Consistent hashing bucket index write path code
hudi-bot commented on PR #6737: URL: https://github.com/apache/hudi/pull/6737#issuecomment-1254470224 ## CI report: * 5e745fc3455ec2ebdf06f1d3068d9c7a112e4987 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11565) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6736: [HUDI-4894] Fix ClassCastException when using fixed type defining dec…
hudi-bot commented on PR #6736: URL: https://github.com/apache/hudi/pull/6736#issuecomment-1254470203 ## CI report: * 255a6aef08b5f9ee25a556baa31d5c329bd8dcfc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11564) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6284: [HUDI-4526] Improve spillableMapBasePath disk directory is full
hudi-bot commented on PR #6284: URL: https://github.com/apache/hudi/pull/6284#issuecomment-1254469770 ## CI report: * 026dbfc7a6d4d7e489e8c8671a84e143bdb01758 UNKNOWN * 4b0a4e72766491e15dbeb8ed904c9aabae32bb89 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11563) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch release-feature-rfc46 updated: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback (#6132)
This is an automated email from the ASF dual-hosted git repository. yuzhaojing pushed a commit to branch release-feature-rfc46 in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/release-feature-rfc46 by this push: new 41392e119f [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback (#6132) 41392e119f is described below commit 41392e119fcc7c7433d415d70b3800bc3dbf0e2b Author: komao AuthorDate: Thu Sep 22 11:17:54 2022 +0800 [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback (#6132) * Update the RFC-46 doc to fix comments feedback * fix Co-authored-by: wangzixuan.wzxuan --- rfc/rfc-46/rfc-46.md | 169 --- 1 file changed, 134 insertions(+), 35 deletions(-) diff --git a/rfc/rfc-46/rfc-46.md b/rfc/rfc-46/rfc-46.md index a851a4443a..192bdbf8c6 100644 --- a/rfc/rfc-46/rfc-46.md +++ b/rfc/rfc-46/rfc-46.md @@ -38,7 +38,7 @@ when dealing with records (during merge, column value extractions, writing into While having a single format of the record representation is certainly making implementation of some components simpler, it bears unavoidable performance penalty of de-/serialization loop: every record handled by Hudi has to be converted -from (low-level) engine-specific representation (`Row` for Spark, `RowData` for Flink, `ArrayWritable` for Hive) into intermediate +from (low-level) engine-specific representation (`InternalRow` for Spark, `RowData` for Flink, `ArrayWritable` for Hive) into intermediate one (Avro), with some operations (like clustering, compaction) potentially incurring this penalty multiple times (on read- and write-paths). @@ -84,59 +84,105 @@ is known to have poor performance (compared to non-reflection based instantiatio Record Merge API -Stateless component interface providing for API Combining Records will look like following: +CombineAndGetUpdateValue and Precombine will converge to one API. Stateless component interface providing for API Combining Records will look like following: ```java -interface HoodieMerge { - HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer); - - Option combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException; -} +interface HoodieRecordMerger { /** -* Spark-specific implementation +* The kind of merging strategy this recordMerger belongs to. A UUID represents merging strategy. */ - class HoodieSparkRecordMerge implements HoodieMerge { + String getMergingStrategy(); + + // This method converges combineAndGetUpdateValue and precombine from HoodiePayload. + // It'd be associative operation: f(a, f(b, c)) = f(f(a, b), c) (which we can translate as having 3 versions A, B, C of the single record, both orders of operations applications have to yield the same result) + Option merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException; + + // The record type handled by the current merger + // SPARK, AVRO, FLINK + HoodieRecordType getRecordType(); +} - @Override - public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) { -// HoodieSparkRecords preCombine - } +/** + * Spark-specific implementation + */ +class HoodieSparkRecordMerger implements HoodieRecordMerger { + + @Override + public String getMergingStrategy() { +return UUID_MERGER_STRATEGY; + } + + @Override + Option merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException { + // HoodieSparkRecord precombine and combineAndGetUpdateValue. It'd be associative operation. + } - @Override - public Option combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) { - // HoodieSparkRecord combineAndGetUpdateValue - } + @Override + HoodieRecordType getRecordType() { + return HoodieRecordType.SPARK; } +} - /** -* Flink-specific implementation -*/ - class HoodieFlinkRecordMerge implements HoodieMerge { - - @Override - public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) { -// HoodieFlinkRecord preCombine - } +/** + * Flink-specific implementation + */ +class HoodieFlinkRecordMerger implements HoodieRecordMerger { + + @Override + public String getMergingStrategy() { + return UUID_MERGER_STRATEGY; + } + + @Override + Option merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException { + // HoodieFlinkRecord precombine and combineAndGetUpdateValue. It'd be associative operation. + } - @Override - public Option combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) { - // HoodieFlinkRecord
[GitHub] [hudi] yuzhaojing merged pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback
yuzhaojing merged PR #6132: URL: https://github.com/apache/hudi/pull/6132 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yuzhaojing merged pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.
yuzhaojing merged PR #5629: URL: https://github.com/apache/hudi/pull/5629 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] IsisPolei commented on issue #6720: [SUPPORT]Caused by: org.apache.hudi.exception.HoodieRemoteException: Connect to 192.168.64.107:34446 [/192.168.64.107] failed: Connection refused (C
IsisPolei commented on issue #6720: URL: https://github.com/apache/hudi/issues/6720#issuecomment-125445 I think the main reason of this problem is that my app(where SparkRDDWriteClient process hudi data) and the spark cluster which SparkRDDWriteClient connected are deployed in different local machine. When both docker containers run in the same host machine everything work well since the containers can connect to each other with docker bridge network(As the hudi docker demo is also one of this scenario). So i'm trying to find out how exactly hudi and spark connect to each other during this process. First i thought if the HoodieSparkEngineContext init successfully the connection part is over. Apparently there is something more. For example the timeline server and remoteFilySystemView also should be reachable because the application will be running in the spark worker node. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4895) Object Store based lock provider
[ https://issues.apache.org/jira/browse/HUDI-4895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuwei Xiao updated HUDI-4895: - Component/s: multi-writer > Object Store based lock provider > > > Key: HUDI-4895 > URL: https://issues.apache.org/jira/browse/HUDI-4895 > Project: Apache Hudi > Issue Type: Improvement > Components: multi-writer >Reporter: Yuwei Xiao >Assignee: Yuwei Xiao >Priority: Major > Labels: pull-request-available > > Currently, we have `FileSystemBasedLockProvier`, which relies on the atomic > guarantee of the underlying file system. Specifically, only with filesystem's > atomic rename & atomic create capability, the LockProvider can work properly. > > However, many of hudi users use object store (e.g., S3, OSS). So we wants to > implement an object store based lock provider. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4812) Lazy partition listing and file groups fetching in Spark Query
[ https://issues.apache.org/jira/browse/HUDI-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuwei Xiao updated HUDI-4812: - Component/s: spark > Lazy partition listing and file groups fetching in Spark Query > -- > > Key: HUDI-4812 > URL: https://issues.apache.org/jira/browse/HUDI-4812 > Project: Apache Hudi > Issue Type: Improvement > Components: spark >Reporter: Yuwei Xiao >Assignee: Yuwei Xiao >Priority: Blocker > Labels: pull-request-available > Fix For: 0.12.1 > > > In current spark query implementation, the FileIndex will refresh and load > all file groups in cached in order to serve subsequent queries. > > For large table with many partitions, this may introduce much overhead in > initialization. Meanwhile, the query itself may come with partition filter. > So the loading of file groups will be unnecessary. > > So to optimize, the whole refresh logic will become lazy, where actual work > will be carried out only after the partition filter. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4896) Consistent hashing index resizing for Flink Engine
Yuwei Xiao created HUDI-4896: Summary: Consistent hashing index resizing for Flink Engine Key: HUDI-4896 URL: https://issues.apache.org/jira/browse/HUDI-4896 Project: Apache Hudi Issue Type: Improvement Components: clustering, index Reporter: Yuwei Xiao -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4895) Object Store based lock provider
[ https://issues.apache.org/jira/browse/HUDI-4895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4895: - Labels: pull-request-available (was: ) > Object Store based lock provider > > > Key: HUDI-4895 > URL: https://issues.apache.org/jira/browse/HUDI-4895 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Yuwei Xiao >Assignee: Yuwei Xiao >Priority: Major > Labels: pull-request-available > > Currently, we have `FileSystemBasedLockProvier`, which relies on the atomic > guarantee of the underlying file system. Specifically, only with filesystem's > atomic rename & atomic create capability, the LockProvider can work properly. > > However, many of hudi users use object store (e.g., S3, OSS). So we wants to > implement an object store based lock provider. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] YuweiXiao opened a new pull request, #6738: [HUDI-4895] Object store based lock provider
YuweiXiao opened a new pull request, #6738: URL: https://github.com/apache/hudi/pull/6738 ### Change Logs Currently, we have `FileSystemBasedLockProvier`, which relies on the atomic guarantee of the underlying file system. Specifically, only with filesystem's atomic rename & atomic create capability, the LockProvider can work properly. This PR enables Object store (e.g, AliyunOSS) as a lock provider. ### Impact No API change. **Risk level: none | low | medium | high** LOW. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-4895) Object Store based lock provider
[ https://issues.apache.org/jira/browse/HUDI-4895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuwei Xiao reassigned HUDI-4895: Assignee: Yuwei Xiao > Object Store based lock provider > > > Key: HUDI-4895 > URL: https://issues.apache.org/jira/browse/HUDI-4895 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Yuwei Xiao >Assignee: Yuwei Xiao >Priority: Major > > Currently, we have `FileSystemBasedLockProvier`, which relies on the atomic > guarantee of the underlying file system. Specifically, only with filesystem's > atomic rename & atomic create capability, the LockProvider can work properly. > > However, many of hudi users use object store (e.g., S3, OSS). So we wants to > implement an object store based lock provider. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] loukey-lj commented on pull request #6704: [HUDI-4780] improve test setup
loukey-lj commented on PR #6704: URL: https://github.com/apache/hudi/pull/6704#issuecomment-1254445377 > @xushiyan thank u,this pr supplements the [6602](https://github.com/apache/hudi/pull/6602) test case. You can first look at the review record of [6602](https://github.com/apache/hudi/pull/6602) . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xicm commented on a diff in pull request #6715: [HUDI-3983] ClassNotFoundException when using hudi-spark-bundle to write table with hbase index
xicm commented on code in PR #6715: URL: https://github.com/apache/hudi/pull/6715#discussion_r977131538 ## hudi-common/src/main/resources/hbase-site.xml: ## @@ -1699,13 +1699,6 @@ possible configurations would overwhelm and obscure the important. Implementation of the status publication with a multicast message. - -hbase.status.listener.class - org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener Review Comment: I mean we use shaded name with bundle jar only. If the dependence we use is hudi-common and the listener class comes from original hbase-client, in this case we will get an exception. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6737: [HUDI-4373] Flink Consistent hashing bucket index write path code
hudi-bot commented on PR #6737: URL: https://github.com/apache/hudi/pull/6737#issuecomment-1254433425 ## CI report: * 5e745fc3455ec2ebdf06f1d3068d9c7a112e4987 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4373) Consistent bucket index write path for Flink engine
[ https://issues.apache.org/jira/browse/HUDI-4373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4373: - Labels: pull-request-available (was: ) > Consistent bucket index write path for Flink engine > --- > > Key: HUDI-4373 > URL: https://issues.apache.org/jira/browse/HUDI-4373 > Project: Apache Hudi > Issue Type: New Feature > Components: flink, index >Reporter: Yuwei Xiao >Assignee: Yuwei Xiao >Priority: Major > Labels: pull-request-available > Original Estimate: 48h > Remaining Estimate: 48h > > Simple bucket index (with fixed bucket number) is ready for Flink engine and > has been used widely in the community. > Since spark now support consistent bucket (dynamic bucket number), we should > bridge the gap and bring this feature to Flink too. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] YuweiXiao opened a new pull request, #6737: [HUDI-4373] Flink Consistent hashing bucket index write path code
YuweiXiao opened a new pull request, #6737: URL: https://github.com/apache/hudi/pull/6737 ### Change Logs Implement consistent hashing bucket index for flink. This PR only covers the write core of the index, and the resizing implementation will be in another PR. There are three main changes: - Extract common code of consistent hashing bucket index, to serve both Spark engine. - Have Flink engine write path adapt to consistent hashing bucket index, e.g., introduce `ConsistentBucketStreamWriteOperator ` - Introduce the basic framework of `UpdateStrategy` for Flink, to handle conflict between concurrent clustering & update. ### Impact No public API change. **Risk level: none | low | medium | high** Low ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6736: [HUDI-4894] Fix ClassCastException when using fixed type defining dec…
hudi-bot commented on PR #6736: URL: https://github.com/apache/hudi/pull/6736#issuecomment-1254427155 ## CI report: * 255a6aef08b5f9ee25a556baa31d5c329bd8dcfc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11564) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6736: [HUDI-4894] Fix ClassCastException when using fixed type defining dec…
hudi-bot commented on PR #6736: URL: https://github.com/apache/hudi/pull/6736#issuecomment-1254423537 ## CI report: * 255a6aef08b5f9ee25a556baa31d5c329bd8dcfc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6735: [HUDI-4892] Fix hudi-spark3-bundle
hudi-bot commented on PR #6735: URL: https://github.com/apache/hudi/pull/6735#issuecomment-1254419744 ## CI report: * 51c0c21c9f5a689943147a1faded74c67fef61a2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11562) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xicm commented on a diff in pull request #6715: [HUDI-3983] ClassNotFoundException when using hudi-spark-bundle to write table with hbase index
xicm commented on code in PR #6715: URL: https://github.com/apache/hudi/pull/6715#discussion_r977117915 ## hudi-common/src/main/resources/hbase-site.xml: ## @@ -1699,13 +1699,6 @@ possible configurations would overwhelm and obscure the important. Implementation of the status publication with a multicast message. - -hbase.status.listener.class - org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener Review Comment: @yihua If we rename the class with the shaded name, there will be a ClassNotFoundException when referencing hudi-common. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-4895) Object Store based lock provider
Yuwei Xiao created HUDI-4895: Summary: Object Store based lock provider Key: HUDI-4895 URL: https://issues.apache.org/jira/browse/HUDI-4895 Project: Apache Hudi Issue Type: Improvement Reporter: Yuwei Xiao Currently, we have `FileSystemBasedLockProvier`, which relies on the atomic guarantee of the underlying file system. Specifically, only with filesystem's atomic rename & atomic create capability, the LockProvider can work properly. However, many of hudi users use object store (e.g., S3, OSS). So we wants to implement an object store based lock provider. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4894) Fix ClassCastException when using fixed type defining decimal column
[ https://issues.apache.org/jira/browse/HUDI-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4894: - Labels: pull-request-available (was: ) > Fix ClassCastException when using fixed type defining decimal column > > > Key: HUDI-4894 > URL: https://issues.apache.org/jira/browse/HUDI-4894 > Project: Apache Hudi > Issue Type: Bug > Components: core >Reporter: Xianghu Wang >Assignee: Xianghu Wang >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > > schema for decimal column : > {code:java} > { > "name": "column_name", > "type": ["null", { > "type": "fixed", > "name": "fixed", > "size": 5, > "logicalType": "decimal", > "precision": 10, > "scale": 2 > }], > "default": null > }{code} > > exception: > Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to > java.util.List > at > org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254) > at > org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151) > at > org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140) > at > org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107) > at > org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96) > at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] wangxianghu opened a new pull request, #6736: [HUDI-4894] Fix ClassCastException when using fixed type defining dec…
wangxianghu opened a new pull request, #6736: URL: https://github.com/apache/hudi/pull/6736 …imal column ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4894) Fix ClassCastException when using fixed type defining decimal column
[ https://issues.apache.org/jira/browse/HUDI-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianghu Wang updated HUDI-4894: --- Description: schema for decimal column : {code:java} { "name": "column_name", "type": ["null", { "type": "fixed", "name": "fixed", "size": 5, "logicalType": "decimal", "precision": 10, "scale": 2 }], "default": null }{code} exception: Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List at org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254) at org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151) at org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140) at org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107) at org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96) at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs was: schema for decimal column : { "name": "column_name", "type": ["null",{ "type": "fixed", "name": "fixed", "size": 5, "logicalType": "decimal", "precision": 10, "scale": 2 }], "default": null } exception: Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List at org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254) at org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151) at org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140) at org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107) at org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96) at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs > Fix ClassCastException when using fixed type defining decimal column > > > Key: HUDI-4894 > URL: https://issues.apache.org/jira/browse/HUDI-4894 > Project: Apache Hudi > Issue Type: Bug > Components: core >Reporter: Xianghu Wang >Assignee: Xianghu Wang >Priority: Major > Fix For: 0.12.1 > > > schema for decimal column : > {code:java} > { > "name": "column_name", > "type": ["null", { > "type": "fixed", > "name": "fixed", > "size": 5, > "logicalType": "decimal", > "precision": 10, > "scale": 2 > }], > "default": null > }{code} > > exception: > Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to > java.util.List > at > org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254) > at > org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151) > at > org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140) > at > org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107) > at > org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96) > at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4894) Fix ClassCastException when using fixed type defining decimal column
[ https://issues.apache.org/jira/browse/HUDI-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianghu Wang updated HUDI-4894: --- Description: schema for decimal column : { "name": "column_name", "type": ["null",{ "type": "fixed", "name": "fixed", "size": 5, "logicalType": "decimal", "precision": 10, "scale": 2 }], "default": null } exception: Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List at org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254) at org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151) at org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140) at org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107) at org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96) at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs was: schema for decimal column : { "name": "column_name", "type": ["null", { "type": "fixed", "name": "fixed", "size": 5, "logicalType": "decimal", "precision": 10, "scale": 2 }], "default": null } exception: Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List at org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254) at org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151) at org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140) at org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107) at org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96) at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs > Fix ClassCastException when using fixed type defining decimal column > > > Key: HUDI-4894 > URL: https://issues.apache.org/jira/browse/HUDI-4894 > Project: Apache Hudi > Issue Type: Bug > Components: core >Reporter: Xianghu Wang >Assignee: Xianghu Wang >Priority: Major > Fix For: 0.12.1 > > > schema for decimal column : > { > "name": "column_name", > "type": ["null",{ > "type": "fixed", > "name": "fixed", > "size": 5, > "logicalType": "decimal", > "precision": 10, > "scale": 2 }], > "default": null > } > > exception: > Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to > java.util.List > at > org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254) > at > org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151) > at > org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140) > at > org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107) > at > org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96) > at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4894) Fix ClassCastException when using fixed type defining decimal column
[ https://issues.apache.org/jira/browse/HUDI-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianghu Wang updated HUDI-4894: --- Description: schema for decimal column : { "name": "column_name", "type": ["null", { "type": "fixed", "name": "fixed", "size": 5, "logicalType": "decimal", "precision": 10, "scale": 2 }], "default": null } exception: Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List at org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254) at org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151) at org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140) at org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107) at org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96) at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs was: schema for decimal column : { "name": "decimal_column_name", "type": ["null", { "type": "fixed", "name": "fixed", "size": 5, "logicalType": "decimal", "precision": 10, "scale": 2 }], "default": null } exception: Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List at org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254) at org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151) at org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140) at org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107) at org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96) at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs > Fix ClassCastException when using fixed type defining decimal column > > > Key: HUDI-4894 > URL: https://issues.apache.org/jira/browse/HUDI-4894 > Project: Apache Hudi > Issue Type: Bug > Components: core >Reporter: Xianghu Wang >Assignee: Xianghu Wang >Priority: Major > Fix For: 0.12.1 > > > schema for decimal column : > { > "name": "column_name", > "type": ["null", { > "type": "fixed", > "name": "fixed", > "size": 5, > "logicalType": "decimal", > "precision": 10, > "scale": 2 > }], > "default": null > } > > exception: > Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to > java.util.List > at > org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254) > at > org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151) > at > org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140) > at > org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107) > at > org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96) > at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4894) Fix ClassCastException when using fixed type defining decimal column
Xianghu Wang created HUDI-4894: -- Summary: Fix ClassCastException when using fixed type defining decimal column Key: HUDI-4894 URL: https://issues.apache.org/jira/browse/HUDI-4894 Project: Apache Hudi Issue Type: Bug Components: core Reporter: Xianghu Wang Assignee: Xianghu Wang Fix For: 0.12.1 schema for decimal column : { "name": "decimal_column_name", "type": ["null", { "type": "fixed", "name": "fixed", "size": 5, "logicalType": "decimal", "precision": 10, "scale": 2 }], "default": null } exception: Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List at org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254) at org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151) at org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140) at org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107) at org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96) at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #6284: [HUDI-4526] Improve spillableMapBasePath disk directory is full
hudi-bot commented on PR #6284: URL: https://github.com/apache/hudi/pull/6284#issuecomment-1254378397 ## CI report: * 026dbfc7a6d4d7e489e8c8671a84e143bdb01758 UNKNOWN * 0ea0766862c16ccec08c7c621f98ca8402f772ff Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10571) * 4b0a4e72766491e15dbeb8ed904c9aabae32bb89 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11563) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #6697: [HUDI-3478] Implement CDC Write in Spark
danny0405 commented on code in PR #6697: URL: https://github.com/apache/hudi/pull/6697#discussion_r977092086 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCDCLogger.java: ## @@ -0,0 +1,253 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.io; + +import org.apache.avro.Schema; +import org.apache.avro.generic.GenericData; +import org.apache.avro.generic.GenericRecord; +import org.apache.avro.generic.IndexedRecord; + +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; + +import org.apache.hudi.avro.HoodieAvroUtils; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieAvroPayload; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieWriteStat; +import org.apache.hudi.common.table.HoodieTableConfig; +import org.apache.hudi.common.table.cdc.HoodieCDCOperation; +import org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode; +import org.apache.hudi.common.table.cdc.HoodieCDCUtils; +import org.apache.hudi.common.table.log.AppendResult; +import org.apache.hudi.common.table.log.HoodieLogFormat; +import org.apache.hudi.common.table.log.block.HoodieCDCDataBlock; +import org.apache.hudi.common.table.log.block.HoodieLogBlock; +import org.apache.hudi.common.util.DefaultSizeEstimator; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.StringUtils; +import org.apache.hudi.common.util.collection.ExternalSpillableMap; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.exception.HoodieUpsertException; + +import java.io.Closeable; +import java.io.IOException; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +/** + * This class encapsulates all the cdc-writing functions. + */ +public class HoodieCDCLogger implements Closeable { + + private final String commitTime; + + private final String keyField; + + private final Schema dataSchema; + + private final boolean populateMetaFields; + + // writer for cdc data + private final HoodieLogFormat.Writer cdcWriter; + + private final boolean cdcEnabled; + + private final HoodieCDCSupplementalLoggingMode cdcSupplementalLoggingMode; + + private final Schema cdcSchema; + + private final String cdcSchemaString; + + // the cdc data + private final Map cdcData; + + public HoodieCDCLogger( + String commitTime, + HoodieWriteConfig config, + HoodieTableConfig tableConfig, + Schema schema, + HoodieLogFormat.Writer cdcWriter, + long maxInMemorySizeInBytes) { +try { + this.commitTime = commitTime; + this.dataSchema = HoodieAvroUtils.removeMetadataFields(schema); + this.populateMetaFields = config.populateMetaFields(); + this.keyField = populateMetaFields ? HoodieRecord.RECORD_KEY_METADATA_FIELD + : tableConfig.getRecordKeyFieldProp(); + this.cdcWriter = cdcWriter; + + this.cdcEnabled = config.getBooleanOrDefault(HoodieTableConfig.CDC_ENABLED); + this.cdcSupplementalLoggingMode = HoodieCDCSupplementalLoggingMode.parse( Review Comment: Is the `cdcEnabled` flag always true here ? Because this is a cdc logger. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #5341: [HUDI-3919] [UBER] Support out of order rollback blocks in AbstractHoodieLogRecordReader
nsivabalan commented on PR #5341: URL: https://github.com/apache/hudi/pull/5341#issuecomment-1254376340 Closing in favor of https://github.com/apache/hudi/pull/5958 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] suryaprasanna closed pull request #5341: [HUDI-3919] [UBER] Support out of order rollback blocks in AbstractHoodieLogRecordReader
suryaprasanna closed pull request #5341: [HUDI-3919] [UBER] Support out of order rollback blocks in AbstractHoodieLogRecordReader URL: https://github.com/apache/hudi/pull/5341 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6284: [HUDI-4526] Improve spillableMapBasePath disk directory is full
hudi-bot commented on PR #6284: URL: https://github.com/apache/hudi/pull/6284#issuecomment-1254375965 ## CI report: * 026dbfc7a6d4d7e489e8c8671a84e143bdb01758 UNKNOWN * 0ea0766862c16ccec08c7c621f98ca8402f772ff Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10571) * 4b0a4e72766491e15dbeb8ed904c9aabae32bb89 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4015: [HUDI-2780] Fix the issue of Mor log skipping complete blocks when reading data
hudi-bot commented on PR #4015: URL: https://github.com/apache/hudi/pull/4015#issuecomment-1254374818 ## CI report: * e1cf530fbae41de33cb9cc76a16a2e6dc5425837 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11560) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #6697: [HUDI-3478] Implement CDC Write in Spark
danny0405 commented on code in PR #6697: URL: https://github.com/apache/hudi/pull/6697#discussion_r977089707 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieSortedMergeHandle.java: ## @@ -93,13 +94,18 @@ public void write(GenericRecord oldRecord) { throw new HoodieUpsertException("Insert/Update not in sorted order"); } try { +Option insertRecord; if (useWriterSchemaForCompaction) { - writeRecord(hoodieRecord, hoodieRecord.getData().getInsertValue(tableSchemaWithMetaFields, config.getProps())); + insertRecord = hoodieRecord.getData().getInsertValue(tableSchemaWithMetaFields, config.getProps()); } else { - writeRecord(hoodieRecord, hoodieRecord.getData().getInsertValue(tableSchema, config.getProps())); + insertRecord = hoodieRecord.getData().getInsertValue(tableSchema, config.getProps()); } +writeRecord(hoodieRecord, insertRecord); insertRecordsWritten++; writtenRecordKeys.add(keyToPreWrite); +if (cdcEnabled) { + cdcLogger.put(hoodieRecord, null, insertRecord); +} Review Comment: If sorted merge deserves a sub-class, we should follow that and give cdc feature a sub-class too, I would see it as a prove that we should keep good expansibility for different use cases and components. > it will quickly become unmanageable Why you call it unmanageable if we only add 2 classes here ? I didn't feel that based on the fact that I already added 5 handles for flink. We can manage them because they are instantiated in a factory in the base commit executor. BTW, Imagine what a mess the code is if i put these flink logic into the base handles. But i agree we need some refactoring to the base handles but that should be very small. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4893) More than 1 splits are created for a single log file for MOR table
[ https://issues.apache.org/jira/browse/HUDI-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4893: -- Status: In Progress (was: Open) > More than 1 splits are created for a single log file for MOR table > -- > > Key: HUDI-4893 > URL: https://issues.apache.org/jira/browse/HUDI-4893 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Fix For: 0.12.1 > > > While debugging a flaky test, realized that we are generating more than 1 > split for one log file itself. Root caused it to isSpllitable() that returns > true for HoodieRealTimePath. > > [https://github.com/apache/hudi/blob/6dbe2960f2eaf0408dc0ef544991cad0190050a9/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java#L91] > > I made a quick fix locally and verified that only one split is generated per > log file. > > {code:java} > git diff > hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > diff --git > a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > > b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > index bba44d5c66..d09dfdf753 100644 > --- > a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > +++ > b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > @@ -89,7 +89,7 @@ public class HoodieRealtimePath extends Path { >} > >public boolean isSplitable() { > -return !toString().isEmpty() && !includeBootstrapFilePath(); > +return !toString().contains(".log") && !includeBootstrapFilePath(); >} > >public PathWithBootstrapFileStatus getPathWithBootstrapFileStatus() { > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4884) Fix website docs for default index type in hudi
[ https://issues.apache.org/jira/browse/HUDI-4884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4884: -- Reviewers: Ethan Guo > Fix website docs for default index type in hudi > --- > > Key: HUDI-4884 > URL: https://issues.apache.org/jira/browse/HUDI-4884 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > > [https://hudi.apache.org/docs/faq#how-does-the-hudi-indexing-work--what-are-its-benefits] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4893) More than 1 splits are created for a single log file for MOR table
[ https://issues.apache.org/jira/browse/HUDI-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4893: -- Story Points: 2 > More than 1 splits are created for a single log file for MOR table > -- > > Key: HUDI-4893 > URL: https://issues.apache.org/jira/browse/HUDI-4893 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Fix For: 0.12.1 > > > While debugging a flaky test, realized that we are generating more than 1 > split for one log file itself. Root caused it to isSpllitable() that returns > true for HoodieRealTimePath. > > [https://github.com/apache/hudi/blob/6dbe2960f2eaf0408dc0ef544991cad0190050a9/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java#L91] > > I made a quick fix locally and verified that only one split is generated per > log file. > > {code:java} > git diff > hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > diff --git > a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > > b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > index bba44d5c66..d09dfdf753 100644 > --- > a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > +++ > b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > @@ -89,7 +89,7 @@ public class HoodieRealtimePath extends Path { >} > >public boolean isSplitable() { > -return !toString().isEmpty() && !includeBootstrapFilePath(); > +return !toString().contains(".log") && !includeBootstrapFilePath(); >} > >public PathWithBootstrapFileStatus getPathWithBootstrapFileStatus() { > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4848) Fix tooling for deprecated partition
[ https://issues.apache.org/jira/browse/HUDI-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4848: -- Reviewers: Raymond Xu > Fix tooling for deprecated partition > - > > Key: HUDI-4848 > URL: https://issues.apache.org/jira/browse/HUDI-4848 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.12.1 > > > hudi cli has support to fix deprecated partition. but it assume "string" > datatype for the partitioning col. We might have to fix that assumption. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-4893) More than 1 splits are created for a single log file for MOR table
[ https://issues.apache.org/jira/browse/HUDI-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-4893: - Assignee: sivabalan narayanan > More than 1 splits are created for a single log file for MOR table > -- > > Key: HUDI-4893 > URL: https://issues.apache.org/jira/browse/HUDI-4893 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Fix For: 0.12.1 > > > While debugging a flaky test, realized that we are generating more than 1 > split for one log file itself. Root caused it to isSpllitable() that returns > true for HoodieRealTimePath. > > [https://github.com/apache/hudi/blob/6dbe2960f2eaf0408dc0ef544991cad0190050a9/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java#L91] > > I made a quick fix locally and verified that only one split is generated per > log file. > > {code:java} > git diff > hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > diff --git > a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > > b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > index bba44d5c66..d09dfdf753 100644 > --- > a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > +++ > b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > @@ -89,7 +89,7 @@ public class HoodieRealtimePath extends Path { >} > >public boolean isSplitable() { > -return !toString().isEmpty() && !includeBootstrapFilePath(); > +return !toString().contains(".log") && !includeBootstrapFilePath(); >} > >public PathWithBootstrapFileStatus getPathWithBootstrapFileStatus() { > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4893) More than 1 splits are created for a single log file for MOR table
[ https://issues.apache.org/jira/browse/HUDI-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4893: -- Sprint: 2022/09/19 > More than 1 splits are created for a single log file for MOR table > -- > > Key: HUDI-4893 > URL: https://issues.apache.org/jira/browse/HUDI-4893 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Fix For: 0.12.1 > > > While debugging a flaky test, realized that we are generating more than 1 > split for one log file itself. Root caused it to isSpllitable() that returns > true for HoodieRealTimePath. > > [https://github.com/apache/hudi/blob/6dbe2960f2eaf0408dc0ef544991cad0190050a9/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java#L91] > > I made a quick fix locally and verified that only one split is generated per > log file. > > {code:java} > git diff > hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > diff --git > a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > > b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > index bba44d5c66..d09dfdf753 100644 > --- > a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > +++ > b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > @@ -89,7 +89,7 @@ public class HoodieRealtimePath extends Path { >} > >public boolean isSplitable() { > -return !toString().isEmpty() && !includeBootstrapFilePath(); > +return !toString().contains(".log") && !includeBootstrapFilePath(); >} > >public PathWithBootstrapFileStatus getPathWithBootstrapFileStatus() { > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4893) More than 1 splits are created for a single log file for MOR table
[ https://issues.apache.org/jira/browse/HUDI-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4893: -- Fix Version/s: 0.12.1 > More than 1 splits are created for a single log file for MOR table > -- > > Key: HUDI-4893 > URL: https://issues.apache.org/jira/browse/HUDI-4893 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: sivabalan narayanan >Priority: Blocker > Fix For: 0.12.1 > > > While debugging a flaky test, realized that we are generating more than 1 > split for one log file itself. Root caused it to isSpllitable() that returns > true for HoodieRealTimePath. > > [https://github.com/apache/hudi/blob/6dbe2960f2eaf0408dc0ef544991cad0190050a9/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java#L91] > > I made a quick fix locally and verified that only one split is generated per > log file. > > {code:java} > git diff > hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > diff --git > a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > > b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > index bba44d5c66..d09dfdf753 100644 > --- > a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > +++ > b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > @@ -89,7 +89,7 @@ public class HoodieRealtimePath extends Path { >} > >public boolean isSplitable() { > -return !toString().isEmpty() && !includeBootstrapFilePath(); > +return !toString().contains(".log") && !includeBootstrapFilePath(); >} > >public PathWithBootstrapFileStatus getPathWithBootstrapFileStatus() { > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4893) More than 1 splits are created for a single log file for MOR table
sivabalan narayanan created HUDI-4893: - Summary: More than 1 splits are created for a single log file for MOR table Key: HUDI-4893 URL: https://issues.apache.org/jira/browse/HUDI-4893 Project: Apache Hudi Issue Type: Bug Components: reader-core Reporter: sivabalan narayanan While debugging a flaky test, realized that we are generating more than 1 split for one log file itself. Root caused it to isSpllitable() that returns true for HoodieRealTimePath. [https://github.com/apache/hudi/blob/6dbe2960f2eaf0408dc0ef544991cad0190050a9/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java#L91] I made a quick fix locally and verified that only one split is generated per log file. {code:java} git diff hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java diff --git a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java index bba44d5c66..d09dfdf753 100644 --- a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java +++ b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java @@ -89,7 +89,7 @@ public class HoodieRealtimePath extends Path { } public boolean isSplitable() { -return !toString().isEmpty() && !includeBootstrapFilePath(); +return !toString().contains(".log") && !includeBootstrapFilePath(); } public PathWithBootstrapFileStatus getPathWithBootstrapFileStatus() { {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4893) More than 1 splits are created for a single log file for MOR table
[ https://issues.apache.org/jira/browse/HUDI-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4893: -- Priority: Blocker (was: Major) > More than 1 splits are created for a single log file for MOR table > -- > > Key: HUDI-4893 > URL: https://issues.apache.org/jira/browse/HUDI-4893 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: sivabalan narayanan >Priority: Blocker > > While debugging a flaky test, realized that we are generating more than 1 > split for one log file itself. Root caused it to isSpllitable() that returns > true for HoodieRealTimePath. > > [https://github.com/apache/hudi/blob/6dbe2960f2eaf0408dc0ef544991cad0190050a9/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java#L91] > > I made a quick fix locally and verified that only one split is generated per > log file. > > {code:java} > git diff > hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > diff --git > a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > > b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > index bba44d5c66..d09dfdf753 100644 > --- > a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > +++ > b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java > @@ -89,7 +89,7 @@ public class HoodieRealtimePath extends Path { >} > >public boolean isSplitable() { > -return !toString().isEmpty() && !includeBootstrapFilePath(); > +return !toString().contains(".log") && !includeBootstrapFilePath(); >} > >public PathWithBootstrapFileStatus getPathWithBootstrapFileStatus() { > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nsivabalan commented on pull request #6284: [HUDI-4526] Improve spillableMapBasePath disk directory is full
nsivabalan commented on PR #6284: URL: https://github.com/apache/hudi/pull/6284#issuecomment-1254371638 @xushiyan : can you review this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #6697: [HUDI-3478] Implement CDC Write in Spark
danny0405 commented on code in PR #6697: URL: https://github.com/apache/hudi/pull/6697#discussion_r977087389 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -292,6 +315,9 @@ protected void writeInsertRecord(HoodieRecord hoodieRecord) throws IOExceptio return; } if (writeRecord(hoodieRecord, insertRecord, HoodieOperation.isDelete(hoodieRecord.getOperation( { + if (cdcEnabled) { +cdcLogger.put(hoodieRecord, null, insertRecord); + } Review Comment: What do you mean for `deserialized twice`, just overwride the `writeRecord` method and add the cdc logger logic should work here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4892) Fix hudi-spark3-bundle
[ https://issues.apache.org/jira/browse/HUDI-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-4892: Sprint: 2022/09/19 > Fix hudi-spark3-bundle > -- > > Key: HUDI-4892 > URL: https://issues.apache.org/jira/browse/HUDI-4892 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > > Using hudi-spark3-bundle with Spark 3.3 shell, the following exception is > thrown. Some classes are not packaged into the bundle. > {code:java} > scala> val df = spark.read.format("hudi").load("") > java.util.ServiceConfigurationError: > org.apache.spark.sql.sources.DataSourceRegister: Provider > org.apache.hudi.Spark32PlusDefaultSource not found > at java.util.ServiceLoader.fail(ServiceLoader.java:239) > at java.util.ServiceLoader.access$300(ServiceLoader.java:185) > at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372) > at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) > at java.util.ServiceLoader$1.next(ServiceLoader.java:480) > at > scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303) > at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297) > at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) > at scala.collection.TraversableLike.filter(TraversableLike.scala:395) > at scala.collection.TraversableLike.filter$(TraversableLike.scala:395) > at scala.collection.AbstractTraversable.filter(Traversable.scala:108) > at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657) > at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:725) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185) > ... 47 elided {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4892) Fix hudi-spark3-bundle
[ https://issues.apache.org/jira/browse/HUDI-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-4892: Status: Patch Available (was: In Progress) > Fix hudi-spark3-bundle > -- > > Key: HUDI-4892 > URL: https://issues.apache.org/jira/browse/HUDI-4892 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > > Using hudi-spark3-bundle with Spark 3.3 shell, the following exception is > thrown. Some classes are not packaged into the bundle. > {code:java} > scala> val df = spark.read.format("hudi").load("") > java.util.ServiceConfigurationError: > org.apache.spark.sql.sources.DataSourceRegister: Provider > org.apache.hudi.Spark32PlusDefaultSource not found > at java.util.ServiceLoader.fail(ServiceLoader.java:239) > at java.util.ServiceLoader.access$300(ServiceLoader.java:185) > at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372) > at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) > at java.util.ServiceLoader$1.next(ServiceLoader.java:480) > at > scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303) > at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297) > at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) > at scala.collection.TraversableLike.filter(TraversableLike.scala:395) > at scala.collection.TraversableLike.filter$(TraversableLike.scala:395) > at scala.collection.AbstractTraversable.filter(Traversable.scala:108) > at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657) > at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:725) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185) > ... 47 elided {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4892) Fix hudi-spark3-bundle
[ https://issues.apache.org/jira/browse/HUDI-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-4892: Status: In Progress (was: Open) > Fix hudi-spark3-bundle > -- > > Key: HUDI-4892 > URL: https://issues.apache.org/jira/browse/HUDI-4892 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > > Using hudi-spark3-bundle with Spark 3.3 shell, the following exception is > thrown. Some classes are not packaged into the bundle. > {code:java} > scala> val df = spark.read.format("hudi").load("") > java.util.ServiceConfigurationError: > org.apache.spark.sql.sources.DataSourceRegister: Provider > org.apache.hudi.Spark32PlusDefaultSource not found > at java.util.ServiceLoader.fail(ServiceLoader.java:239) > at java.util.ServiceLoader.access$300(ServiceLoader.java:185) > at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372) > at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) > at java.util.ServiceLoader$1.next(ServiceLoader.java:480) > at > scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303) > at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297) > at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) > at scala.collection.TraversableLike.filter(TraversableLike.scala:395) > at scala.collection.TraversableLike.filter$(TraversableLike.scala:395) > at scala.collection.AbstractTraversable.filter(Traversable.scala:108) > at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657) > at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:725) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185) > ... 47 elided {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #6735: [HUDI-4892] Fix hudi-spark3-bundle
hudi-bot commented on PR #6735: URL: https://github.com/apache/hudi/pull/6735#issuecomment-1254343526 ## CI report: * 51c0c21c9f5a689943147a1faded74c67fef61a2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11562) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6498: [HUDI-4878] Fix incremental cleaner use case
hudi-bot commented on PR #6498: URL: https://github.com/apache/hudi/pull/6498#issuecomment-1254343298 ## CI report: * 3c05d0af21cc79358b7c0ffb7aad579da19129db Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11561) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6735: [HUDI-4892] Fix hudi-spark3-bundle
hudi-bot commented on PR #6735: URL: https://github.com/apache/hudi/pull/6735#issuecomment-1254341123 ## CI report: * 51c0c21c9f5a689943147a1faded74c67fef61a2 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6498: [HUDI-4878] Fix incremental cleaner use case
hudi-bot commented on PR #6498: URL: https://github.com/apache/hudi/pull/6498#issuecomment-1254340889 ## CI report: * 054e2a560ef080b3591d52f3b2d1cd8b3c2ab0f7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11169) * 3c05d0af21cc79358b7c0ffb7aad579da19129db UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4892) Fix hudi-spark3-bundle
[ https://issues.apache.org/jira/browse/HUDI-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4892: - Labels: pull-request-available (was: ) > Fix hudi-spark3-bundle > -- > > Key: HUDI-4892 > URL: https://issues.apache.org/jira/browse/HUDI-4892 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > > Using hudi-spark3-bundle with Spark 3.3 shell, the following exception is > thrown. Some classes are not packaged into the bundle. > {code:java} > scala> val df = spark.read.format("hudi").load("") > java.util.ServiceConfigurationError: > org.apache.spark.sql.sources.DataSourceRegister: Provider > org.apache.hudi.Spark32PlusDefaultSource not found > at java.util.ServiceLoader.fail(ServiceLoader.java:239) > at java.util.ServiceLoader.access$300(ServiceLoader.java:185) > at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372) > at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) > at java.util.ServiceLoader$1.next(ServiceLoader.java:480) > at > scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303) > at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297) > at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) > at scala.collection.TraversableLike.filter(TraversableLike.scala:395) > at scala.collection.TraversableLike.filter$(TraversableLike.scala:395) > at scala.collection.AbstractTraversable.filter(Traversable.scala:108) > at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657) > at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:725) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185) > ... 47 elided {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] yihua opened a new pull request, #6735: [HUDI-4892] Fix hudi-spark3-bundle
yihua opened a new pull request, #6735: URL: https://github.com/apache/hudi/pull/6735 ### Change Logs This PR fixes the hudi-spark3-bundle. Before this PR, reading a Hudi table with Spark datasource in Spark 3.3 shell with hudi-spark3-bundle throws the following exception. Some classes are not packaged into the spark3 bundle. ``` scala> val df = spark.read.format("hudi").load("") java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.hudi.Spark32PlusDefaultSource not found at java.util.ServiceLoader.fail(ServiceLoader.java:239) at java.util.ServiceLoader.access$300(ServiceLoader.java:185) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372) at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) at java.util.ServiceLoader$1.next(ServiceLoader.java:480) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303) at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297) at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) at scala.collection.TraversableLike.filter(TraversableLike.scala:395) at scala.collection.TraversableLike.filter$(TraversableLike.scala:395) at scala.collection.AbstractTraversable.filter(Traversable.scala:108) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:725) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185) ... 47 elided ``` ### Impact **Risk level: low** Fixing the hudi-spark3-bundle packaging only to avoid class not found. Tested locally and on EMR that the hudi-spark3-bundle works after the fix. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6734: [HUDI-3478][HUDI-4887] Use Avro as the format of persisted cdc data
alexeykudinkin commented on code in PR #6734: URL: https://github.com/apache/hudi/pull/6734#discussion_r977061910 ## hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java: ## @@ -109,6 +109,11 @@ public static Schema createNullableSchema(Schema.Type avroType) { return Schema.createUnion(Schema.create(Schema.Type.NULL), Schema.create(avroType)); Review Comment: Let's rebase this one onto new one you're adding ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala: ## @@ -0,0 +1,238 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.functional.cdc + +import org.apache.avro.Schema +import org.apache.avro.generic.IndexedRecord + +import org.apache.hadoop.fs.Path + +import org.apache.hudi.DataSourceWriteOptions._ +import org.apache.hudi.common.config.HoodieMetadataConfig +import org.apache.hudi.common.model.{HoodieCommitMetadata, HoodieLogFile} +import org.apache.hudi.common.table.cdc.{HoodieCDCSupplementalLoggingMode, HoodieCDCUtils} +import org.apache.hudi.common.table.log.HoodieLogFormat +import org.apache.hudi.common.table.log.block.{HoodieDataBlock, HoodieLogBlock} +import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.common.table.timeline.HoodieInstant +import org.apache.hudi.common.testutils.RawTripTestPayload.{deleteRecordsToStrings, recordsToStrings} +import org.apache.hudi.config.{HoodieCleanConfig, HoodieWriteConfig} +import org.apache.hudi.testutils.HoodieClientTestBase + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.SaveMode + +import org.junit.jupiter.api.{AfterEach, BeforeEach} +import org.junit.jupiter.api.Assertions.{assertEquals, assertFalse, assertTrue} +import org.junit.jupiter.params.ParameterizedTest +import org.junit.jupiter.params.provider.CsvSource + +import scala.collection.JavaConversions._ +import scala.collection.JavaConverters._ + +class TestCDCDataFrameSuite extends HoodieClientTestBase { + + var spark: SparkSession = _ + + val commonOpts = Map( +HoodieTableConfig.CDC_ENABLED.key -> "true", +"hoodie.insert.shuffle.parallelism" -> "4", +"hoodie.upsert.shuffle.parallelism" -> "4", +"hoodie.bulkinsert.shuffle.parallelism" -> "2", +"hoodie.delete.shuffle.parallelism" -> "1", +RECORDKEY_FIELD.key -> "_row_key", +PRECOMBINE_FIELD.key -> "timestamp", +HoodieWriteConfig.TBL_NAME.key -> "hoodie_test", +HoodieMetadataConfig.COMPACT_NUM_DELTA_COMMITS.key -> "1", +HoodieCleanConfig.AUTO_CLEAN.key -> "false" + ) + + @BeforeEach override def setUp(): Unit = { +setTableName("hoodie_test") +initPath() +initSparkContexts() +spark = sqlContext.sparkSession +initTestDataGenerator() +initFileSystem() + } + + @AfterEach override def tearDown(): Unit = { +cleanupSparkContexts() +cleanupTestDataGenerator() +cleanupFileSystem() + } + + @ParameterizedTest + @CsvSource(Array("cdc_op_key", "cdc_data_before", "cdc_data_before_after")) + def testCOWDataSourceWrite(cdcSupplementalLoggingMode: String): Unit = { +val options = commonOpts ++ Map( + HoodieTableConfig.CDC_SUPPLEMENTAL_LOGGING_MODE.key -> cdcSupplementalLoggingMode +) + +// Insert Operation +val records1 = recordsToStrings(dataGen.generateInserts("000", 100)).toList +val inputDF1 = spark.read.json(spark.sparkContext.parallelize(records1, 2)) +inputDF1.write.format("org.apache.hudi") + .options(options) + .mode(SaveMode.Overwrite) + .save(basePath) + +// init meta client +metaClient = HoodieTableMetaClient.builder() + .setBasePath(basePath) + .setConf(spark.sessionState.newHadoopConf) + .build() +val instant1 = metaClient.reloadActiveTimeline.lastInstant().get() +assertEquals(spark.read.format("hudi").load(basePath).count(), 100) +// all the data is new-coming, it will write out cdc log files. +assertFalse(hasCDCLogFile(instant1)) + +val schemaResolver = new TableSchemaResolver(metaClient) +val dataSchema = schemaResolver.getTableAvroSchema(false) +val cdcSchema =
[jira] [Created] (HUDI-4892) Fix hudi-spark3-bundle
Ethan Guo created HUDI-4892: --- Summary: Fix hudi-spark3-bundle Key: HUDI-4892 URL: https://issues.apache.org/jira/browse/HUDI-4892 Project: Apache Hudi Issue Type: Bug Reporter: Ethan Guo Assignee: Ethan Guo Fix For: 0.12.1 Using hudi-spark3-bundle with Spark 3.3 shell, the following exception is thrown. Some classes are not packaged into the bundle. {code:java} scala> val df = spark.read.format("hudi").load("") java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.hudi.Spark32PlusDefaultSource not found at java.util.ServiceLoader.fail(ServiceLoader.java:239) at java.util.ServiceLoader.access$300(ServiceLoader.java:185) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372) at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) at java.util.ServiceLoader$1.next(ServiceLoader.java:480) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303) at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297) at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) at scala.collection.TraversableLike.filter(TraversableLike.scala:395) at scala.collection.TraversableLike.filter$(TraversableLike.scala:395) at scala.collection.AbstractTraversable.filter(Traversable.scala:108) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:725) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185) ... 47 elided {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nsivabalan commented on pull request #6498: [HUDI-4878] Fix incremental cleaner use case
nsivabalan commented on PR #6498: URL: https://github.com/apache/hudi/pull/6498#issuecomment-1254323578 @codope: Can you review this patch. I have overhauled the initial fix put up. But could result in good perf improv for cleaning. I am yet to write tests. but do take a look at my logic and let me know if it looks ok. or is there any case that I could be missing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] CTTY commented on a diff in pull request #5113: [HUDI-3625] [RFC-60] Optimized storage layout for Cloud Object Stores
CTTY commented on code in PR #5113: URL: https://github.com/apache/hudi/pull/5113#discussion_r977051856 ## rfc/rfc-56/rfc-56.md: ## @@ -0,0 +1,226 @@ + + +# RFC-56: Federated Storage Layer + +## Proposers +- @umehrot2 + +## Approvers +- @vinoth +- @shivnarayan + +## Status + +JIRA: [https://issues.apache.org/jira/browse/HUDI-3625](https://issues.apache.org/jira/browse/HUDI-3625) + +## Abstract + +As you scale your Apache Hudi workloads over Cloud object stores like Amazon S3, there is potential of hitting request +throttling limits which in-turn impacts performance. In this RFC, we are proposing to support an alternate storage +layout that is optimized for Amazon S3 and other cloud object stores, which helps achieve maximum throughput and +significantly reduce throttling. + +In addition, we are proposing an interface that would allow users to implement their own custom strategy to allow them +to distribute the data files across cloud stores, hdfs or on prem based on their specific use-cases. + +## Background + +Apache Hudi follows the traditional Hive storage layout while writing files on storage: +- Partitioned Tables: The files are distributed across multiple physical partition folders, under the table's base path. +- Non Partitioned Tables: The files are stored directly under the table's base path. + +While this storage layout scales well for HDFS, it increases the probability of hitting request throttle limits when +working with cloud object stores like Amazon S3 and others. This is because Amazon S3 and other cloud stores [throttle +requests based on object prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/). +Amazon S3 does scale based on request patterns for different prefixes and adds internal partitions (with their own request limits), +but there can be a 30 - 60 minute wait time before new partitions are created. Thus, all files/objects stored under the +same table path prefix could result in these request limits being hit for the table prefix, specially as workloads +scale, and there are several thousands of files being written/updated concurrently. This hurts performance due to +re-trying of failed requests affecting throughput, and result in occasional failures if the retries are not able to +succeed either and continue to be throttled. + +The traditional storage layout also tightly couples the partitions as folders under the table path. However, +some users want flexibility to be able to distribute files/partitions under multiple different paths across cloud stores, +hdfs etc. based on their specific needs. For example, customers have use cases to distribute files for each partition under +a separate S3 bucket with its individual encryption key. It is not possible to implement such use-cases with Hudi currently. + +The high level proposal here is to introduce a new storage layout strategy, where all files are distributed evenly across +multiple randomly generated prefixes under the Amazon S3 bucket, instead of being stored under a common table path/prefix. +This would help distribute the requests evenly across different prefixes, resulting in Amazon S3 to create partitions for +the prefixes each with its own request limit. This significantly reduces the possibility of hitting the request limit +for a specific prefix/partition. + +In addition, we want to expose an interface that provides users the flexibility to implement their own strategy for +distributing files if using the traditional Hive storage layout or federated storage layer (proposed in this RFC) does +not meet their use-case. + +## Design + +### Interface + +```java +/** + * Interface for providing storage file locations. + */ +public interface FederatedStorageStrategy extends Serializable { + /** + * Return a fully-qualified storage file location for the given filename. + * + * @param fileName data file name + * @return a fully-qualified location URI for a data file + */ + String storageLocation(String fileName); + + /** + * Return a fully-qualified storage file location for the given partition and filename. + * + * @param partitionPath partition path for the file + * @param fileName data file name + * @return a fully-qualified location URI for a data file + */ + String storageLocation(String partitionPath, String fileName); +} +``` + +### Generating file paths for Cloud storage optimized layout + +We want to distribute files evenly across multiple random prefixes, instead of following the traditional Hive storage +layout of keeping them under a common table path/prefix. In addition to the `Table Path`, for this new layout user will +configure another `Table Storage Path` under which the actual data files will be distributed. The original `Table Path` will +be used to maintain the table/partitions Hudi metadata. + +For the purpose of this documentation lets assume: +``` +Table Path => s3: + +Table
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.
alexeykudinkin commented on code in PR #5629: URL: https://github.com/apache/hudi/pull/5629#discussion_r977042462 ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecord.java: ## @@ -291,59 +284,51 @@ public void checkState() { } } - // - - // - // NOTE: This method duplicates those ones of the HoodieRecordPayload and are placed here - // for the duration of RFC-46 implementation, until migration off `HoodieRecordPayload` - // is complete - // - public abstract HoodieRecord mergeWith(HoodieRecord other, Schema readerSchema, Schema writerSchema) throws IOException; + /** + * Get column in record to support RDDCustomColumnsSortPartitioner + */ + public abstract Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled); - public abstract HoodieRecord rewriteRecord(Schema recordSchema, Schema targetSchema, TypedProperties props) throws IOException; + /** + * Support bootstrap. + */ + public abstract HoodieRecord mergeWith(HoodieRecord other, Schema targetSchema) throws IOException; Review Comment: Understood. Let's keep it for now, but just rename it to `joinWith` to avoid confusion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.
alexeykudinkin commented on code in PR #5629: URL: https://github.com/apache/hudi/pull/5629#discussion_r977041457 ## hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkStructTypeSerializer.scala: ## @@ -0,0 +1,157 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hudi + +import com.esotericsoftware.kryo.Kryo +import com.esotericsoftware.kryo.io.{Input, Output} +import com.twitter.chill.KSerializer +import java.io.{ByteArrayInputStream, ByteArrayOutputStream, ObjectInputStream, ObjectOutputStream} +import java.nio.ByteBuffer +import java.nio.charset.StandardCharsets +import org.apache.avro.SchemaNormalization +import org.apache.commons.io.IOUtils +import org.apache.hudi.commmon.model.HoodieSparkRecord +import org.apache.spark.io.CompressionCodec +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.Utils +import org.apache.spark.{SparkEnv, SparkException} +import scala.collection.mutable + +/** + * Custom serializer used for generic spark records. If the user registers the schemas + * ahead of time, then the schema's fingerprint will be sent with each message instead of the actual + * schema, as to reduce network IO. + * Actions like parsing or compressing schemas are computationally expensive so the serializer + * caches all previously seen values as to reduce the amount of work needed to do. + * @param schemas a map where the keys are unique IDs for spark schemas and the values are the + *string representation of the Avro schema, used to decrease the amount of data + *that needs to be serialized. + */ +class SparkStructTypeSerializer(schemas: Map[Long, StructType]) extends KSerializer[HoodieSparkRecord] { Review Comment: https://hudi.apache.org/docs/quick-start-guide -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.
alexeykudinkin commented on code in PR #5629: URL: https://github.com/apache/hudi/pull/5629#discussion_r977040996 ## hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkStructTypeSerializer.scala: ## @@ -0,0 +1,157 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hudi + +import com.esotericsoftware.kryo.Kryo +import com.esotericsoftware.kryo.io.{Input, Output} +import com.twitter.chill.KSerializer +import java.io.{ByteArrayInputStream, ByteArrayOutputStream, ObjectInputStream, ObjectOutputStream} +import java.nio.ByteBuffer +import java.nio.charset.StandardCharsets +import org.apache.avro.SchemaNormalization +import org.apache.commons.io.IOUtils +import org.apache.hudi.commmon.model.HoodieSparkRecord +import org.apache.spark.io.CompressionCodec +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.Utils +import org.apache.spark.{SparkEnv, SparkException} +import scala.collection.mutable + +/** + * Custom serializer used for generic spark records. If the user registers the schemas + * ahead of time, then the schema's fingerprint will be sent with each message instead of the actual + * schema, as to reduce network IO. + * Actions like parsing or compressing schemas are computationally expensive so the serializer + * caches all previously seen values as to reduce the amount of work needed to do. + * @param schemas a map where the keys are unique IDs for spark schemas and the values are the + *string representation of the Avro schema, used to decrease the amount of data + *that needs to be serialized. + */ +class SparkStructTypeSerializer(schemas: Map[Long, StructType]) extends KSerializer[HoodieSparkRecord] { Review Comment: Sorry, my bad i wasn't clear enough -- we will have to - Implement Registrar to make sure it does register our custom serializer - Make sure we update the docs to include it (and make sure to highlight it in the change-log), similarly to how we recommend including `spark.serializer` config -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on issue #6640: [SUPPORT] HUDI partition table duplicate data cow hudi 0.10.0 flink 1.13.1
yihua commented on issue #6640: URL: https://github.com/apache/hudi/issues/6640#issuecomment-1254303659 @yuzhaojing @danny0405 Could any one of you chime in here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on issue #6644: Hudi Multi Writer DynamoDBBasedLocking issue
yihua commented on issue #6644: URL: https://github.com/apache/hudi/issues/6644#issuecomment-1254302236 @koochiswathiTR Thanks for raising this! The config naming of `partition_key` is confusing to new comers. Here's what you need to do: (1) As @xushiyan already mentioned, you don't need to set the credentials in env variables if the instance or service is already granted access with the proper roles; (2) By default, `hoodie.write.lock.dynamodb.partition_key` is set to the table name, so that multiple writers writing to the same table share the same lock. If you customize the name, make sure it's the same for multiple writers; (3) Note that, what `hoodie.write.lock.dynamodb.partition_key` specifies actually means the value to use for the column, and not the column name itself. The column name is fixed to be `key` in DynamoDB table; (4) The DynamoDB table for locking purposes is automatically created from the Hudi code, so you don't have to create the table yourself. If you do so, make sure that the `key` column is present in the table, not `lock` or the value specified by `hoodie.write.lock.dynamodb.partition_key`. Let me know if this solves your problem. Feel free to close it once all good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4015: [HUDI-2780] Fix the issue of Mor log skipping complete blocks when reading data
hudi-bot commented on PR #4015: URL: https://github.com/apache/hudi/pull/4015#issuecomment-1254300684 ## CI report: * 375927ade5b4b327e44ebc227fb57e64de524fcc Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3426) * e1cf530fbae41de33cb9cc76a16a2e6dc5425837 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11560) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4015: [HUDI-2780] Fix the issue of Mor log skipping complete blocks when reading data
hudi-bot commented on PR #4015: URL: https://github.com/apache/hudi/pull/4015#issuecomment-1254296165 ## CI report: * 375927ade5b4b327e44ebc227fb57e64de524fcc Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3426) * e1cf530fbae41de33cb9cc76a16a2e6dc5425837 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bhasudha commented on a diff in pull request #6638: [DOCS] Add tags to blog pages
bhasudha commented on code in PR #6638: URL: https://github.com/apache/hudi/pull/6638#discussion_r977030742 ## README.md: ## @@ -156,6 +156,44 @@ Example: When you change any file in `versioned_docs/version-0.7.0/`, it will on ## Configs Configs can be automatically updated by following these steps documented at ../hudi-utils/README.md +## Blogs + +When adding a new blog, please follow these guidelines. + +1. Every Blog should have the `title`, `authors`, `image`, `tags` in the metadata of the blog. For example the front matter +for a blog should look like below. +``` +--- +title: "Blog title" +author: FirstName LastName +category: blog +image: /assets/images/blog/ +tags: +- how-to +- deltastreamer +- incremental-processing +- apache hudi +--- +``` +2. The blog can be inline or referring to an external blog. If its an inline blog please save it as `.md` file. +Example for an inline blog - (Build Open Lakehouse using Apache Hudi & dbt)[https://github.com/apache/hudi/blob/asf-site/website/blog/2022-07-11-build-open-lakehouse-using-apache-hudi-and-dbt.md]. +If the blog is referring to an external blog you would need to embed the redirect url and save it as a `.mdx` file. +Take a look at this blog for reference - (Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Compariso)[https://raw.githubusercontent.com/apache/hudi/asf-site/website/blog/2022-08-18-Apache-Hudi-vs-Delta-Lake-vs-Apache-Iceberg-Lakehouse-Feature-Comparison.mdx] +3. The image must be uploaded in the path /assets/images/blog/ and should be of standard size 1200 * 600 +4. The tags should be representative of these + 1. tag1 + - how-to (tutorial, recipes, show case how to use feature x) Review Comment: ah yes. I ll add it in a followup pr -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [DOCS] Add tags to blog pages (#6638)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 12ebe2bdef [DOCS] Add tags to blog pages (#6638) 12ebe2bdef is described below commit 12ebe2bdef369cf7eb80cb2767e88fbbcb4f10d6 Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com> AuthorDate: Wed Sep 21 15:18:50 2022 -0700 [DOCS] Add tags to blog pages (#6638) --- README.md | 38 ++ ...e-Case-for-incremental-processing-on-Hadoop.mdx | 4 +++ ...-Incremental-Processing-Framework-on-Hadoop.mdx | 4 +++ .../blog/2019-05-14-registering-dataset-to-hive.md | 3 ++ .../blog/2019-09-09-ingesting-database-changes.md | 3 ++ website/blog/2019-10-22-Hudi-On-Hops.mdx | 3 ++ ...-Data-on-S3-with-Amazon-EMR-and-Apache-Hudi.mdx | 3 ++ website/blog/2020-01-15-delete-support-in-hudi.md | 4 +++ .../blog/2020-01-20-change-capture-using-aws.md| 5 +++ website/blog/2020-03-22-exporting-hudi-datasets.md | 4 +++ .../blog/2020-04-27-apache-hudi-apache-zepplin.md | 4 +++ ...0-05-28-monitoring-hudi-metrics-with-datadog.md | 4 +++ ...nnounces-Apache-Hudi-as-a-Top-Level-Project.mdx | 3 ++ ...ctional-Data-Lake-at-Uber-Using-Apache-Hudi.mdx | 5 +++ ...-Apache-Hudi-grows-cloud-data-lake-maturity.mdx | 3 ++ .../blog/2020-08-04-PrestoDB-and-Apache-Hudi.mdx | 3 ++ ...18-hudi-incremental-processing-on-data-lakes.md | 5 +++ ...-efficient-migration-of-large-parquet-tables.md | 5 +++ ...2020-08-21-async-compaction-deployment-model.md | 4 +++ ...2020-08-22-ingest-multiple-tables-using-hudi.md | 4 +++ ...020-10-06-cdc-solution-using-hudi-by-nclouds.md | 4 +++ .../2020-10-15-apache-hudi-meets-apache-flink.md | 4 +++ .../2020-10-19-Origins-of-Data-Lake-at-Grofers.mdx | 6 .../2020-10-19-hudi-meets-aws-emr-and-aws-dms.md | 3 ++ ...Enterprise-at-Data-Summit-Connect-Fall-2020.mdx | 3 ++ ...apture-using-Apache-Hudi-and-Amazon-AMS-EMR.mdx | 5 +++ .../blog/2020-11-11-hudi-indexing-mechanisms.md| 4 +++ ...-11-29-Can-Big-Data-Solutions-Be-Affordable.mdx | 5 +++ ...gh-perf-data-lake-with-hudi-and-alluxio-t3go.md | 6 website/blog/2021-01-27-hudi-clustering-intro.md | 4 +++ website/blog/2021-02-13-hudi-key-generators.md | 4 +++ ...ravel-operations-in-Hopsworks-Feature-Store.mdx | 6 ...-Generation-of-Data-Lakes-using-Apache-Hudi.mdx | 4 +++ website/blog/2021-03-01-hudi-file-sizing.md| 4 +++ ...-stream-for-amazon-dynamodb-and-apache-hudi.mdx | 4 +++ ...New-features-from-Apache-hudi-in-Amazon-EMR.mdx | 3 ++ ...-Apache-Spark-and-Apache-Hudi-on-Amazon-EMR.mdx | 4 +++ .../2021-05-12-Experts-primer-on-Apache-Hudi.mdx | 3 ++ ...ow-Uber-gets-data-a-ride-to-its-destination.mdx | 3 ++ ...loying-right-configurations-for-hudi-cleaner.md | 6 +++- ...6-Amazon-Athena-expands-Apache-Hudi-support.mdx | 3 ++ ...e-with-amazon-athena-Read-optimized-queries.mdx | 4 +++ .../2021-07-21-streaming-data-lake-platform.md | 4 +++ ...-lake-evolution-scheme-based-on-Apache-Hudi.mdx | 5 +++ ...ars-Versioned-Feature-Data-with-a-Lakehouse.mdx | 7 ...cient-Open-Source-Big-Data-Platform-at-Uber.mdx | 7 .../blog/2021-08-16-kafka-custom-deserializer.md | 6 .../blog/2021-08-18-improving-marker-mechanism.md | 5 +++ website/blog/2021-08-18-virtual-keys.md| 4 +++ website/blog/2021-08-23-async-clustering.md| 4 +++ website/blog/2021-08-23-s3-events-source.md| 4 +++ ...g-eb-level-data-lake-using-hudi-at-bytedance.md | 3 ++ .../blog/2021-10-05-Data-Platform-2.0-Part-I.mdx | 5 +++ ...abyte-scale-using-AWS-Glue-with-Apache-Hudi.mdx | 5 +++ ...n-building-real-time-data-lake-at-station-B.mdx | 4 +++ ...-at-enterprise-scale-using-the-AWS-platform.mdx | 4 +++ ...-Hudi-Architecture-Tools-and-Best-Practices.mdx | 3 ++ ...se-concurrency-control-are-we-too-optimistic.md | 4 +++ ...udi-0.7.0-and-0.8.0-available-on-Amazon-EMR.mdx | 3 ++ ...hudi-zorder-and-hilbert-space-filling-curves.md | 5 +++ ...es-with-Apache-Hudi-Kafka-Hive-and-Debezium.mdx | 4 +++ ...2022-01-06-apache-hudi-2021-a-year-in-review.md | 4 +++ ...e-data-capture-with-debezium-and-apache-hudi.md | 7 +++- ...nd-How-I-Integrated-Airbyte-and-Apache-Hudi.mdx | 4 +++ ...-lake-efforts-at-Walmart-and-Disney-Hotstar.mdx | 3 ++ ...st-Efficiency-Scale-in-Big-Data-File-Format.mdx | 6 .../2022-02-02-Onehouse-Commitment-to-Openness.mdx | 4 +++ ...gs-a-fully-managed-lakehouse-to-Apache-Hudi.mdx | 4 +++ ...-transformations-on-Distributed-file-system.mdx | 3 ++ ...ating-Current-Interest-and-Rate-of-Adoption.mdx | 6 .../2022-02-17-Fresher-Data-Lake-on-AWS-S3.mdx | 4 +++ ...s-core-concepts-from-hudi-persistence-files.mdx | 4 +++
[GitHub] [hudi] nsivabalan commented on a diff in pull request #6638: [DOCS] Add tags to blog pages
nsivabalan commented on code in PR #6638: URL: https://github.com/apache/hudi/pull/6638#discussion_r976869703 ## README.md: ## @@ -156,6 +156,44 @@ Example: When you change any file in `versioned_docs/version-0.7.0/`, it will on ## Configs Configs can be automatically updated by following these steps documented at ../hudi-utils/README.md +## Blogs + +When adding a new blog, please follow these guidelines. + +1. Every Blog should have the `title`, `authors`, `image`, `tags` in the metadata of the blog. For example the front matter +for a blog should look like below. +``` +--- +title: "Blog title" +author: FirstName LastName +category: blog +image: /assets/images/blog/ +tags: +- how-to +- deltastreamer +- incremental-processing +- apache hudi +--- +``` +2. The blog can be inline or referring to an external blog. If its an inline blog please save it as `.md` file. +Example for an inline blog - (Build Open Lakehouse using Apache Hudi & dbt)[https://github.com/apache/hudi/blob/asf-site/website/blog/2022-07-11-build-open-lakehouse-using-apache-hudi-and-dbt.md]. +If the blog is referring to an external blog you would need to embed the redirect url and save it as a `.mdx` file. +Take a look at this blog for reference - (Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Compariso)[https://raw.githubusercontent.com/apache/hudi/asf-site/website/blog/2022-08-18-Apache-Hudi-vs-Delta-Lake-vs-Apache-Iceberg-Lakehouse-Feature-Comparison.mdx] +3. The image must be uploaded in the path /assets/images/blog/ and should be of standard size 1200 * 600 +4. The tags should be representative of these + 1. tag1 + - how-to (tutorial, recipes, show case how to use feature x) Review Comment: guess we might need to add blog as one of the value. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan merged pull request #6638: [DOCS] Add tags to blog pages
nsivabalan merged PR #6638: URL: https://github.com/apache/hudi/pull/6638 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on issue #6398: [SUPPORT] Metadata table thows hbase exceptions
yihua commented on issue #6398: URL: https://github.com/apache/hudi/issues/6398#issuecomment-1254287346 > @yihua yes this parameter is placed in separate hbase-site.xml which is used by spark. Thanks for the confirmation! I'll also list this as a workaround in our FAQ. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #4015: [HUDI-2780] Fix the issue of Mor log skipping complete blocks when reading data
nsivabalan commented on PR #4015: URL: https://github.com/apache/hudi/pull/4015#issuecomment-1254285358 have pushed out a commit by myself to address feedback. yet to see if we can cover the fix w/ a test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua closed issue #6658: [SUPPORT] undrop table
yihua closed issue #6658: [SUPPORT] undrop table URL: https://github.com/apache/hudi/issues/6658 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on issue #6658: [SUPPORT] undrop table
yihua commented on issue #6658: URL: https://github.com/apache/hudi/issues/6658#issuecomment-1254283326 @melin Thank you for raising this feature request! I created a Jira ticket to track the work and let's follow up there: HUDI-4891. Closing this support ticket. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6516: [HUDI-4729] Fix fq can not be queried in pending compaction when query ro table with spark
alexeykudinkin commented on code in PR #6516: URL: https://github.com/apache/hudi/pull/6516#discussion_r977025908 ## hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java: ## @@ -665,13 +671,21 @@ public final Stream getLatestFileSlicesBeforeOrOn(String partitionStr readLock.lock(); String partitionPath = formatPartitionKey(partitionStr); ensurePartitionLoadedCorrectly(partitionPath); - Stream fileSliceStream = fetchLatestFileSlicesBeforeOrOn(partitionPath, maxCommitTime) - .filter(slice -> !isFileGroupReplacedBeforeOrOn(slice.getFileGroupId(), maxCommitTime)); + Stream> allFileSliceStream = fetchAllStoredFileGroups(partitionPath) Review Comment: `Stream>` doesn't make sense, let's flat-map it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4891) Support UNDROP TABLE in Spark SQL
[ https://issues.apache.org/jira/browse/HUDI-4891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-4891: Description: Specifies the identifier for the table to restore. If the identifier contains spaces or special characters, the entire string must be enclosed in double quotes. Identifiers enclosed in double quotes are also case-sensitive. # Restoring tables is only supported in the current schema or current database, even if the table name is fully-qualified. # If a table with the same name already exists, an error is returned. # UNDROP relies on the Snowflake Time Travel feature. An object can be restored only if the object was deleted within the. The default value is 24 hours. [https://docs.snowflake.com/en/sql-reference/sql/undrop-table.html] > Support UNDROP TABLE in Spark SQL > - > > Key: HUDI-4891 > URL: https://issues.apache.org/jira/browse/HUDI-4891 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > > Specifies the identifier for the table to restore. If the identifier contains > spaces or special characters, the entire string must be enclosed in double > quotes. Identifiers enclosed in double quotes are also case-sensitive. > # Restoring tables is only supported in the current schema or current > database, even if the table name is fully-qualified. > # If a table with the same name already exists, an error is returned. > # UNDROP relies on the Snowflake Time Travel feature. An object can be > restored only if the object was deleted within the. The default value is 24 > hours. > [https://docs.snowflake.com/en/sql-reference/sql/undrop-table.html] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4891) Support UNDROP TABLE in Spark SQL
Ethan Guo created HUDI-4891: --- Summary: Support UNDROP TABLE in Spark SQL Key: HUDI-4891 URL: https://issues.apache.org/jira/browse/HUDI-4891 Project: Apache Hudi Issue Type: Improvement Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4891) Support UNDROP TABLE in Spark SQL
[ https://issues.apache.org/jira/browse/HUDI-4891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-4891: Fix Version/s: 1.0.0 > Support UNDROP TABLE in Spark SQL > - > > Key: HUDI-4891 > URL: https://issues.apache.org/jira/browse/HUDI-4891 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > > Specifies the identifier for the table to restore. If the identifier contains > spaces or special characters, the entire string must be enclosed in double > quotes. Identifiers enclosed in double quotes are also case-sensitive. > # Restoring tables is only supported in the current schema or current > database, even if the table name is fully-qualified. > # If a table with the same name already exists, an error is returned. > # UNDROP relies on the Snowflake Time Travel feature. An object can be > restored only if the object was deleted within the. The default value is 24 > hours. > [https://docs.snowflake.com/en/sql-reference/sql/undrop-table.html] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
alexeykudinkin commented on code in PR #6046: URL: https://github.com/apache/hudi/pull/6046#discussion_r977019700 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java: ## @@ -275,6 +345,66 @@ private HoodieData> readRecordsForGroupBaseFiles(JavaSparkContex .map(record -> transform(record, writeConfig))); } + /** + * Get dataset of all records for the group. This includes all records from file slice (Apply updates from log files, if any). + */ + private Dataset readRecordsForGroupAsRow(JavaSparkContext jsc, +HoodieClusteringGroup clusteringGroup, +String instantTime) { +List clusteringOps = clusteringGroup.getSlices().stream() +.map(ClusteringOperation::create).collect(Collectors.toList()); +boolean hasLogFiles = clusteringOps.stream().anyMatch(op -> op.getDeltaFilePaths().size() > 0); +SQLContext sqlContext = new SQLContext(jsc.sc()); + +Path[] baseFilePaths = clusteringOps +.stream() +.map(op -> { + ArrayList readPaths = new ArrayList<>(); + if (op.getBootstrapFilePath() != null) { +readPaths.add(op.getBootstrapFilePath()); + } + if (op.getDataFilePath() != null) { +readPaths.add(op.getDataFilePath()); + } + return readPaths; +}) +.flatMap(Collection::stream) +.filter(path -> !path.isEmpty()) +.map(Path::new) +.toArray(Path[]::new); + +HashMap params = new HashMap<>(); +params.put("hoodie.datasource.query.type", "snapshot"); +params.put("as.of.instant", instantTime); + +Path[] paths; +if (hasLogFiles) { + String compactionFractor = Option.ofNullable(getWriteConfig().getString("compaction.memory.fraction")) + .orElse("0.75"); + params.put("compaction.memory.fraction", compactionFractor); + + Path[] deltaPaths = clusteringOps + .stream() + .filter(op -> !op.getDeltaFilePaths().isEmpty()) + .flatMap(op -> op.getDeltaFilePaths().stream()) + .map(Path::new) + .toArray(Path[]::new); + paths = CollectionUtils.combine(baseFilePaths, deltaPaths); +} else { + paths = baseFilePaths; +} + +String readPathString = String.join(",", Arrays.stream(paths).map(Path::toString).toArray(String[]::new)); +params.put("hoodie.datasource.read.paths", readPathString); +// Building HoodieFileIndex needs this param to decide query path +params.put("glob.paths", readPathString); + +// Let Hudi relations to fetch the schema from the table itself +BaseRelation relation = SparkAdapterSupport$.MODULE$.sparkAdapter() Review Comment: :+1: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
alexeykudinkin commented on PR #6046: URL: https://github.com/apache/hudi/pull/6046#issuecomment-1254275167 @boneanxs thank you very much for iterating on this one! Truly monumental effort! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
alexeykudinkin commented on PR #6046: URL: https://github.com/apache/hudi/pull/6046#issuecomment-1254275527 Did you try to re-run your benchmark after the changes we've made? If so, can you please paste the results in here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on issue #6686: Apache Hudi Consistency issues with glue and marketplace connector
yihua commented on issue #6686: URL: https://github.com/apache/hudi/issues/6686#issuecomment-1254272868 @asankadarshana007 The consistency check, when enabled, happens when removing invalid data files: (1) check that all paths to delete exist, (2) delete them, (3) wait for all paths to disappear after eventual consistency. Note that this logic is not needed for strong consistency. As the invalid data files are now determined based on the markers, there could be a case where a marker is created, but the data file has not started being written, so that the check (1) fails, which is okay. Given that there is no use case for the eventual consistency atm, we don't maintain the logic. Let me know if turning off `hoodie.consistency.check.enabled` solves your problem. You can close the ticket if all good. ``` if (!invalidDataPaths.isEmpty()) { LOG.info("Removing duplicate data files created due to task retries before committing. Paths=" + invalidDataPaths); Map>> invalidPathsByPartition = invalidDataPaths.stream() .map(dp -> Pair.of(new Path(basePath, dp).getParent().toString(), new Path(basePath, dp).toString())) .collect(Collectors.groupingBy(Pair::getKey)); // Ensure all files in delete list is actually present. This is mandatory for an eventually consistent FS. // Otherwise, we may miss deleting such files. If files are not found even after retries, fail the commit if (consistencyCheckEnabled) { // This will either ensure all files to be deleted are present. waitForAllFiles(context, invalidPathsByPartition, FileVisibility.APPEAR); } // Now delete partially written files context.setJobStatus(this.getClass().getSimpleName(), "Delete all partially written files: " + config.getTableName()); deleteInvalidFilesByPartitions(context, invalidPathsByPartition); // Now ensure the deleted files disappear if (consistencyCheckEnabled) { // This will either ensure all files to be deleted are absent. waitForAllFiles(context, invalidPathsByPartition, FileVisibility.DISAPPEAR); } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-3796) Implement layout to filter out uncommitted log files without reading the log blocks
[ https://issues.apache.org/jira/browse/HUDI-3796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607985#comment-17607985 ] sivabalan narayanan commented on HUDI-3796: --- changing the name of the log file is a pretty big change. dont' think we can get it into 0.12.1. punting this for now. > Implement layout to filter out uncommitted log files without reading the log > blocks > --- > > Key: HUDI-3796 > URL: https://issues.apache.org/jira/browse/HUDI-3796 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: Ethan Guo >Assignee: sivabalan narayanan >Priority: Critical > Fix For: 0.12.1 > > > Related: HUDI-3637 > At high level, getLatestFileSlices() is going to fetch the latest file slices > for committed base files and filter out any file slices with the uncommitted > base instant time. The uncommitted log files in the latest file slices may > be included, and they are skipped while doing log reading and merging, i.e., > the logic in "AbstractHoodieLogRecordReader". > We can use log instant time instead of base instant time for the log file > name so that it is able to filter out uncommitted log files without reading > the log blocks beforehand. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3796) Implement layout to filter out uncommitted log files without reading the log blocks
[ https://issues.apache.org/jira/browse/HUDI-3796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-3796: -- Sprint: (was: 2022/09/19) > Implement layout to filter out uncommitted log files without reading the log > blocks > --- > > Key: HUDI-3796 > URL: https://issues.apache.org/jira/browse/HUDI-3796 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: Ethan Guo >Assignee: sivabalan narayanan >Priority: Critical > Fix For: 0.12.1 > > > Related: HUDI-3637 > At high level, getLatestFileSlices() is going to fetch the latest file slices > for committed base files and filter out any file slices with the uncommitted > base instant time. The uncommitted log files in the latest file slices may > be included, and they are skipped while doing log reading and merging, i.e., > the logic in "AbstractHoodieLogRecordReader". > We can use log instant time instead of base instant time for the log file > name so that it is able to filter out uncommitted log files without reading > the log blocks beforehand. -- This message was sent by Atlassian Jira (v8.20.10#820010)