[GitHub] [hudi] danny0405 commented on issue #8148: [SUPPORT]
danny0405 commented on issue #8148: URL: https://github.com/apache/hudi/issues/8148#issuecomment-1463408930 That's a nice analysis @kkrugler , let's see if we can solve this in elegant way! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-5917) MOR table log file has only one replication
[ https://issues.apache.org/jira/browse/HUDI-5917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandy du updated HUDI-5917: --- Description: When mor table enable HoodieRetryWrapperFileSystem through the configuration `hoodie.filesystem.operation.retry.enable=true` ,log file in hdfs only has one replication. (was: When mor talbe enable HoodieRetryWrapperFileSystem through the configuration `hoodie.filesystem.operation.retry.enable=true` ,log file in hdfs only has one replication.) > MOR table log file has only one replication > --- > > Key: HUDI-5917 > URL: https://issues.apache.org/jira/browse/HUDI-5917 > Project: Apache Hudi > Issue Type: Bug >Reporter: sandy du >Priority: Major > Labels: pull-request-available > > When mor table enable HoodieRetryWrapperFileSystem through the configuration > `hoodie.filesystem.operation.retry.enable=true` ,log file in hdfs only has > one replication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5917) MOR table log file has only one replication
[ https://issues.apache.org/jira/browse/HUDI-5917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandy du updated HUDI-5917: --- Summary: MOR table log file has only one replication (was: MOR Table Log file has only one replication) > MOR table log file has only one replication > --- > > Key: HUDI-5917 > URL: https://issues.apache.org/jira/browse/HUDI-5917 > Project: Apache Hudi > Issue Type: Bug >Reporter: sandy du >Priority: Major > Labels: pull-request-available > > When mor talbe enable HoodieRetryWrapperFileSystem through the configuration > `hoodie.filesystem.operation.retry.enable=true` ,log file in hdfs only has > one replication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5740) Refactor Deltastreamer and schema providers to use HoodieConfig/ConfigProperty
[ https://issues.apache.org/jira/browse/HUDI-5740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-5740: - Labels: pull-request-available (was: ) > Refactor Deltastreamer and schema providers to use HoodieConfig/ConfigProperty > -- > > Key: HUDI-5740 > URL: https://issues.apache.org/jira/browse/HUDI-5740 > Project: Apache Hudi > Issue Type: Improvement > Components: configs, deltastreamer >Reporter: Jonathan Vexler >Assignee: Lokesh Jain >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.0 > > > The configs in the following classes are implemented not using HoodieConfig, > making it impossible to be surfaced on the Configurations page. We need to > refactor the code so that each config property is implemented using > ConfigProperty in a corresponding new HoodieConfig class. Refer to > HoodieArchivalConfig for existing implementation of configs. > > InitialCheckPointProvider > HoodieDeltaStreamer > HoodieMultiTableDeltaStreamer > FilebasedSchemaProvider > HiveSchemaProvider > JdbcbasedSchemaProvider > ProtoClassBasedSchemaProvider > SchemaPostProcessor > SchemaRegistryProvider > SparkAvroPostProcessor > DropColumnSchemaPostProcessor > BaseSchemaPostProcessorConfig > KafkaOffsetPostProcessor > SanitizationUtils > Also 'hoodie.deltastreamer.multiwriter.source.checkpoint.id' in > HoodieWriteConfig -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] lokeshj1703 opened a new pull request, #8152: [HUDI-5740] Refactor Deltastreamer and schema providers to use HoodieConfig/ConfigProperty
lokeshj1703 opened a new pull request, #8152: URL: https://github.com/apache/hudi/pull/8152 ### Change Logs The configs in the following classes are implemented not using HoodieConfig, making it impossible to be surfaced on the Configurations page. We need to refactor the code so that each config property is implemented using ConfigProperty in a corresponding new HoodieConfig class. Refer to HoodieArchivalConfig for existing implementation of configs. InitialCheckPointProvider HoodieDeltaStreamer HoodieMultiTableDeltaStreamer FilebasedSchemaProvider HiveSchemaProvider JdbcbasedSchemaProvider ProtoClassBasedSchemaProvider SchemaPostProcessor SchemaRegistryProvider SparkAvroPostProcessor DropColumnSchemaPostProcessor BaseSchemaPostProcessorConfig KafkaOffsetPostProcessor SanitizationUtils Also 'hoodie.deltastreamer.multiwriter.source.checkpoint.id' in HoodieWriteConfig ### Impact NA ### Risk level (write none, low medium or high below) low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8150: [HUDI-5917] Fix HoodieRetryWrapperFileSystem getDefaultReplication
hudi-bot commented on PR #8150: URL: https://github.com/apache/hudi/pull/8150#issuecomment-1463383059 ## CI report: * b822947584be483fcc23fd1880d2212f31ae386d UNKNOWN * 6dc5a2866114879b660baceae026bf8574126af3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15652) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8149: [HUDI-5915] Fixed load ckpMeatadata error when using minio
hudi-bot commented on PR #8149: URL: https://github.com/apache/hudi/pull/8149#issuecomment-1463383019 ## CI report: * b04749aba0c507eb67fd6dd756e21ed7f1e3535e Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15650) * 64fff59128deb511ed29c4ac7972345e6dab1bd7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15653) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8150: [HUDI-5917] Fix HoodieRetryWrapperFileSystem getDefaultReplication
hudi-bot commented on PR #8150: URL: https://github.com/apache/hudi/pull/8150#issuecomment-1463376825 ## CI report: * b822947584be483fcc23fd1880d2212f31ae386d UNKNOWN * 6dc5a2866114879b660baceae026bf8574126af3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8149: [HUDI-5915] Fixed load ckpMeatadata error when using minio
hudi-bot commented on PR #8149: URL: https://github.com/apache/hudi/pull/8149#issuecomment-1463376789 ## CI report: * b04749aba0c507eb67fd6dd756e21ed7f1e3535e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15650) * 64fff59128deb511ed29c4ac7972345e6dab1bd7 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8133: [HUDI-5904] support more than one update actions in merge into table
hudi-bot commented on PR #8133: URL: https://github.com/apache/hudi/pull/8133#issuecomment-1463376698 ## CI report: * 8e3fad5fa9e9c64e7e345a317865f6fe6a9a7620 UNKNOWN * a690c5122694914f975ebbb717e06630ac3b5902 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15646) * a9f08395c3578b1567ec34ed61fb34acc219aa28 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] MrAladdin opened a new issue, #8151: [SUPPORT]org.apache.hudi.exception.HoodieCompactionException: Could not compact /.hoodie/metadata
MrAladdin opened a new issue, #8151: URL: https://github.com/apache/hudi/issues/8151 **Describe the problem you faced** hudi metadata table : compaction exception authenticated : hudi 0.12.2 ok hudi 0.13.0 compaction exception **Expected behavior** org.apache.hudi.exception.HoodieCompactionException: Could not compact /.hoodie/metadata **Environment Description** * Hudi version :0.13.0 * Spark version :3.3.1 * Hive version :3.1.2 * Hadoop version :3.1.3 * Storage (HDFS/S3/GCS..) :HDFS * Running on Docker? (yes/no) :no **Additional context** .option(DataSourceWriteOptions.OPERATION.key(), DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.TABLE_TYPE.key(), DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL) .option("hoodie.index.type", "BUCKET") .option("hoodie.index.bucket.engine", "CONSISTENT_HASHING") **Stacktrace** Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2293) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1021) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:406) at org.apache.spark.rdd.RDD.collect(RDD.scala:1020) at org.apache.spark.api.java.JavaRDDLike.collect(JavaRDDLike.scala:362) at org.apache.spark.api.java.JavaRDDLike.collect$(JavaRDDLike.scala:361) at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45) at org.apache.hudi.data.HoodieJavaRDD.collectAsList(HoodieJavaRDD.java:163) at org.apache.hudi.table.action.compact.RunCompactionActionExecutor.execute(RunCompactionActionExecutor.java:101) ... 66 more Caused by: org.apache.hudi.exception.HoodieException: Exception when reading log file at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecordReader.java:376) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:223) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:198) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:114) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:73) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:464) at org.apache.hudi.table.action.compact.HoodieCompactor.compact(HoodieCompactor.java:204) at org.apache.hudi.table.action.compact.HoodieCompactor.lambda$compact$9cd4b1be$1(HoodieCompactor.java:129) at org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
[GitHub] [hudi] hudi-bot commented on pull request #8133: [HUDI-5904] support more than one update actions in merge into table
hudi-bot commented on PR #8133: URL: https://github.com/apache/hudi/pull/8133#issuecomment-1463368031 ## CI report: * 8e3fad5fa9e9c64e7e345a317865f6fe6a9a7620 UNKNOWN * a690c5122694914f975ebbb717e06630ac3b5902 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15646) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8150: [HUDI-5917] Fix HoodieRetryWrapperFileSystem getDefaultReplication
hudi-bot commented on PR #8150: URL: https://github.com/apache/hudi/pull/8150#issuecomment-1463368135 ## CI report: * b822947584be483fcc23fd1880d2212f31ae386d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #8071: [SUPPORT]How to improve the speed of Flink writing to hudi ?
danny0405 commented on issue #8071: URL: https://github.com/apache/hudi/issues/8071#issuecomment-1463363633 Thanks, for COW table with insert operation, Flink does not use any index, so the bucket index does not work here, the write throughput should be high, and for UPSERTs with bucket index, if you use the COW, yes, the performance is bad because the whole table/partition is almot rewritten each ckp. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8088: [HUDI-5873] The pending compactions of dataset table should not block…
danny0405 commented on code in PR #8088: URL: https://github.com/apache/hudi/pull/8088#discussion_r1132010791 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -1029,23 +1029,63 @@ protected HoodieData prepRecords(MapCases to be handled: + * + * We cannot perform compaction if there are previous inflight operations on the dataset. This is because + * a compacted metadata base file at time Tx should represent all the actions on the dataset till time Tx; + * In multi-writer scenario, a parallel operation with a greater instantTime may have completed creating a + * deltacommit. + * */ protected void compactIfNecessary(BaseHoodieWriteClient writeClient, String instantTime) { // finish off any pending compactions if any from previous attempt. writeClient.runAnyPendingCompactions(); -String latestDeltaCommitTimeInMetadataTable = metadataMetaClient.reloadActiveTimeline() +HoodieTimeline metadataCompletedDeltaCommitTimeline = metadataMetaClient.reloadActiveTimeline() .getDeltaCommitTimeline() -.filterCompletedInstants() +.filterCompletedInstants(); +String latestDeltaCommitTimeInMetadataTable = metadataCompletedDeltaCommitTimeline .lastInstant().orElseThrow(() -> new HoodieMetadataException("No completed deltacommit in metadata table")) .getTimestamp(); -List pendingInstants = dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested() +Set metadataCompletedDeltaCommits = metadataCompletedDeltaCommitTimeline.getInstantsAsStream() +.map(HoodieInstant::getTimestamp) +.collect(Collectors.toSet()); +// pending compactions in DT should not block the compaction of MDT. +// a pending compaction on the DT(for MOR table, this is a common case) +// could cause the MDT compaction not been triggered in time, +// the slow compaction progress of MDT can further affect the timeline archiving of DT, +// which would result in both timelines from DT and MDT can not be archived timely, +// that is how the small file issues from both the DT and MDT timelines emerge. + +// why we could filter out the compaction commit that has not been committed into the MDT? + +// there are 2 preconditions that need to address first: +// 1. only the write commits (commit, delta_commit, replace_commit) can trigger the MDT compaction; +// 2. the MDT is always committed before the DT. + +// there are 3 cases we want to analyze for a compaction instant from DT: +// 1. both the DT and MDT does not commit the instant; +//1.1 the compaction in DT is normal, it just lags long time to finish; +//1.2 some error happens to the compaction procedure. +// 2. the MDT committed the compaction instant, while the DT hadn't; +//2.1 the job crashed suddenly while the compactor tries to commit to the DT right after the MDT has been committed; +//2.2 the job has been canceled manually right after the MDT has been committed. +// 3. both the DT and MDT commit the instant. + +// the 3rd case should be okay, now let's analyze the first 2 cases: +// +// the 1st case: if the instant has not been committed yet, the compaction of MDT would just ignore the instant, +// so the pending instant can not be compacted into the HFile, the instant should also not be archived by both of the DT and the MDT(that is how the archival mechanism works), +// the log reader of MDT would ignore the instant correctly, the result view should work! + +// the 2nd case: we can not trigger compact, because once the MDT triggers, the MDT archiver can then archive the instant, but this instant has not been committed in the DT, +// the MDT reader can not filter out the instant correctly, another reason is once the instant is compacted into HFile, the subsequent rollback from DT may try to look up +// the files to be rolled back, an exception could throw(although the default behavior is not to throws). + Review Comment: Let me explain the procedure a little more with a demo: ```java delta_c1 (F3, F4) (MDT) delta_c1 (F1, F2) (DT) c2.inflight (compaction triggers in DT) delta_c3 (F7, F8) (MDT) delta_c3 (F5, F6) (DT) c2 (F7, F8) (compaction complete in MDT) c2 failes to commit to DT delta_c4 (F9, F10) (MDT) -- can we trigger MDT compaction here? The answer is yes 1. c2 in DT would block the archiving of C2 in MDT 2. the MDT reader would ignore the C2 too because it is filtered by the c2 on DT timeline, so the compaction does not include c2 delta_c4 (F11, F12) (DT) r5 (to rollback c2) (MDT) -F7, -F8 r5 (to rollback c2) (DT) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp
[GitHub] [hudi] danny0405 commented on a diff in pull request #8088: [HUDI-5873] The pending compactions of dataset table should not block…
danny0405 commented on code in PR #8088: URL: https://github.com/apache/hudi/pull/8088#discussion_r1132010791 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -1029,23 +1029,63 @@ protected HoodieData prepRecords(MapCases to be handled: + * + * We cannot perform compaction if there are previous inflight operations on the dataset. This is because + * a compacted metadata base file at time Tx should represent all the actions on the dataset till time Tx; + * In multi-writer scenario, a parallel operation with a greater instantTime may have completed creating a + * deltacommit. + * */ protected void compactIfNecessary(BaseHoodieWriteClient writeClient, String instantTime) { // finish off any pending compactions if any from previous attempt. writeClient.runAnyPendingCompactions(); -String latestDeltaCommitTimeInMetadataTable = metadataMetaClient.reloadActiveTimeline() +HoodieTimeline metadataCompletedDeltaCommitTimeline = metadataMetaClient.reloadActiveTimeline() .getDeltaCommitTimeline() -.filterCompletedInstants() +.filterCompletedInstants(); +String latestDeltaCommitTimeInMetadataTable = metadataCompletedDeltaCommitTimeline .lastInstant().orElseThrow(() -> new HoodieMetadataException("No completed deltacommit in metadata table")) .getTimestamp(); -List pendingInstants = dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested() +Set metadataCompletedDeltaCommits = metadataCompletedDeltaCommitTimeline.getInstantsAsStream() +.map(HoodieInstant::getTimestamp) +.collect(Collectors.toSet()); +// pending compactions in DT should not block the compaction of MDT. +// a pending compaction on the DT(for MOR table, this is a common case) +// could cause the MDT compaction not been triggered in time, +// the slow compaction progress of MDT can further affect the timeline archiving of DT, +// which would result in both timelines from DT and MDT can not be archived timely, +// that is how the small file issues from both the DT and MDT timelines emerge. + +// why we could filter out the compaction commit that has not been committed into the MDT? + +// there are 2 preconditions that need to address first: +// 1. only the write commits (commit, delta_commit, replace_commit) can trigger the MDT compaction; +// 2. the MDT is always committed before the DT. + +// there are 3 cases we want to analyze for a compaction instant from DT: +// 1. both the DT and MDT does not commit the instant; +//1.1 the compaction in DT is normal, it just lags long time to finish; +//1.2 some error happens to the compaction procedure. +// 2. the MDT committed the compaction instant, while the DT hadn't; +//2.1 the job crashed suddenly while the compactor tries to commit to the DT right after the MDT has been committed; +//2.2 the job has been canceled manually right after the MDT has been committed. +// 3. both the DT and MDT commit the instant. + +// the 3rd case should be okay, now let's analyze the first 2 cases: +// +// the 1st case: if the instant has not been committed yet, the compaction of MDT would just ignore the instant, +// so the pending instant can not be compacted into the HFile, the instant should also not be archived by both of the DT and the MDT(that is how the archival mechanism works), +// the log reader of MDT would ignore the instant correctly, the result view should work! + +// the 2nd case: we can not trigger compact, because once the MDT triggers, the MDT archiver can then archive the instant, but this instant has not been committed in the DT, +// the MDT reader can not filter out the instant correctly, another reason is once the instant is compacted into HFile, the subsequent rollback from DT may try to look up +// the files to be rolled back, an exception could throw(although the default behavior is not to throws). + Review Comment: Let me explain the procedure a little more with a demo: ```java delta_c1 (F3, F4) (MDT) delta_c1 (F1, F2) (DT) c2.inflight (compaction triggers in DT) delta_c3 (F7, F8) (MDT) delta_c3 (F5, F6) (DT) c2 (F7, F8) (compaction complete in MDT) c2 failes to commit to DT delta_c4 (F9, F10) (MDT) -- can we trigger MDT compaction here? The answer is yes 1. c2 in DT would block the archiving of C2 in MDT 2. the MDT reader would ignore the C2 too because it is filtered by the c2 on DT timeline, so the compaction does not include c2 delta_c4 (F11, F12) (DT) r5 (to rollback c2) (MDT) -F7, -F8 r5 (to rollback c2) (DT) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To
[jira] [Closed] (HUDI-5851) Refactor ExpressionEvaluators to split into 2 phase: evaluator conversion and evaluator execution
[ https://issues.apache.org/jira/browse/HUDI-5851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-5851. Fix Version/s: 0.14.0 Resolution: Fixed Fixed via master branch: 79428391bac7277ffa9e18c75594a6fb9b8c5665 > Refactor ExpressionEvaluators to split into 2 phase: evaluator conversion and > evaluator execution > - > > Key: HUDI-5851 > URL: https://issues.apache.org/jira/browse/HUDI-5851 > Project: Apache Hudi > Issue Type: Sub-task > Components: flink, flink-sql >Reporter: Jing Zhang >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] xuzifu666 commented on a diff in pull request #8133: [HUDI-5904] support more than one update actions in merge into table
xuzifu666 commented on code in PR #8133: URL: https://github.com/apache/hudi/pull/8133#discussion_r1132008580 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala: ## @@ -115,6 +116,65 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase with ScalaAssertionSuppo }) } + test("Test MergeInto with more than once update actions") { +withRecordType()(withTempDir {tmp => + val conf = new SparkConf().setAppName("insertDatasToHudi").setMaster("local[*]") + val spark = SparkSession.builder().config(conf) +.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") Review Comment: ok,thanks. I add a Todo issue https://issues.apache.org/jira/browse/HUDI-5918 @XuQianJin-Stars -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-5918) merge into in mutil update action without preCombine key
xy created HUDI-5918: Summary: merge into in mutil update action without preCombine key Key: HUDI-5918 URL: https://issues.apache.org/jira/browse/HUDI-5918 Project: Apache Hudi Issue Type: Bug Components: spark-sql Reporter: xy merge into in mutil update action without preCombine key -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-5851] Improvement of data skipping, only converts expressions to evaluators once (#8051)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 79428391bac [HUDI-5851] Improvement of data skipping, only converts expressions to evaluators once (#8051) 79428391bac is described below commit 79428391bac7277ffa9e18c75594a6fb9b8c5665 Author: Jing Zhang AuthorDate: Fri Mar 10 14:53:16 2023 +0800 [HUDI-5851] Improvement of data skipping, only converts expressions to evaluators once (#8051) * Add log to FileIndex about the data skipping info * Move all evaluators and relative utility in one class --- .../java/org/apache/hudi/source/DataPruner.java| 140 + .../apache/hudi/source/ExpressionEvaluators.java | 576 .../java/org/apache/hudi/source/FileIndex.java | 46 +- .../org/apache/hudi/source/stats/ColumnStats.java | 72 +++ .../hudi/source/stats/ExpressionEvaluator.java | 605 - .../hudi/source/TestExpressionEvaluators.java | 408 ++ .../hudi/source/stats/TestExpressionEvaluator.java | 403 -- .../apache/hudi/table/ITTestHoodieDataSource.java | 7 + 8 files changed, 1230 insertions(+), 1027 deletions(-) diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/DataPruner.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/DataPruner.java new file mode 100644 index 000..605fcdf7fb0 --- /dev/null +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/DataPruner.java @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.source; + +import org.apache.hudi.source.stats.ColumnStats; +import org.apache.hudi.util.ExpressionUtils; + +import org.apache.flink.table.data.RowData; +import org.apache.flink.table.expressions.ResolvedExpression; +import org.apache.flink.table.types.logical.DecimalType; +import org.apache.flink.table.types.logical.LogicalType; +import org.apache.flink.table.types.logical.RowType; +import org.apache.flink.table.types.logical.TimestampType; + +import java.io.Serializable; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; + +import static org.apache.hudi.source.ExpressionEvaluators.fromExpression; + +/** + * Utility to do data skipping. + */ +public class DataPruner implements Serializable { + private static final long serialVersionUID = 1L; + + private final String[] referencedCols; + private final List evaluators; + + private DataPruner(String[] referencedCols, List evaluators) { +this.referencedCols = referencedCols; +this.evaluators = evaluators; + } + + /** + * Filters the index row with specific data filters and query fields. + * + * @param indexRowThe index row + * @param queryFields The query fields referenced by the filters + * @return true if the index row should be considered as a candidate + */ + public boolean test(RowData indexRow, RowType.RowField[] queryFields) { +Map columnStatsMap = convertColumnStats(indexRow, queryFields); +for (ExpressionEvaluators.Evaluator evaluator : evaluators) { + if (!evaluator.eval(columnStatsMap)) { +return false; + } +} +return true; + } + + public String[] getReferencedCols() { +return referencedCols; + } + + public static DataPruner newInstance(List filters) { +if (filters == null || filters.size() == 0) { + return null; +} +String[] referencedCols = ExpressionUtils.referencedColumns(filters); +if (referencedCols.length == 0) { + return null; +} +List evaluators = fromExpression(filters); +return new DataPruner(referencedCols, evaluators); + } + + public static Map convertColumnStats(RowData indexRow, RowType.RowField[] queryFields) { +if (indexRow == null || queryFields == null) { + throw new IllegalArgumentException("Index Row and query fields could not be null."); +} +Map mapping = new LinkedHashMap<>(); +for (int i = 0; i < queryFields.length; i++) { + String name = queryFields[i].ge
[GitHub] [hudi] danny0405 merged pull request #8051: [HUDI-5851] Improvement of data skipping, only converts expressions to evaluators once
danny0405 merged PR #8051: URL: https://github.com/apache/hudi/pull/8051 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #8147: [SUPPORT] Missing dependency on hive-exec (core)
danny0405 commented on issue #8147: URL: https://github.com/apache/hudi/issues/8147#issuecomment-1463348786 > WriteProfiles.getCommitMetadata I see, in thie PR https://github.com/apache/hudi/pull/7055, I have moved the utilities method into another class which is located in `hudi-common`, so this should not be a problem anymore. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #8147: [SUPPORT] Missing dependency on hive-exec (core)
danny0405 commented on issue #8147: URL: https://github.com/apache/hudi/issues/8147#issuecomment-1463347507 You are right, the bundle jar should keep more slim. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 closed issue #8136: [SUPPORT] Wrong type returned by ParquetColumnarRowSplitReader in hudi-flink1.16.x code
danny0405 closed issue #8136: [SUPPORT] Wrong type returned by ParquetColumnarRowSplitReader in hudi-flink1.16.x code URL: https://github.com/apache/hudi/issues/8136 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-5917) MOR Table Log file has only one replication
[ https://issues.apache.org/jira/browse/HUDI-5917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-5917: - Labels: pull-request-available (was: ) > MOR Table Log file has only one replication > --- > > Key: HUDI-5917 > URL: https://issues.apache.org/jira/browse/HUDI-5917 > Project: Apache Hudi > Issue Type: Bug >Reporter: sandy du >Priority: Major > Labels: pull-request-available > > When mor talbe enable HoodieRetryWrapperFileSystem through the configuration > `hoodie.filesystem.operation.retry.enable=true` ,log file in hdfs only has > one replication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] sandyfog opened a new pull request, #8150: [HUDI-5917] Fix HoodieRetryWrapperFileSystem getDefaultReplication
sandyfog opened a new pull request, #8150: URL: https://github.com/apache/hudi/pull/8150 ### Change Logs When mor talbe enable HoodieRetryWrapperFileSystem through the configuration `hoodie.filesystem.operation.retry.enable=true` ,log file in hdfs only has one replication. ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-5917) MOR Table Log file has only one replication
[ https://issues.apache.org/jira/browse/HUDI-5917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandy du updated HUDI-5917: --- Description: When mor talbe enable HoodieRetryWrapperFileSystem through the configuration `hoodie.filesystem.operation.retry.enable=true` ,log file in hdfs only has one replication. (was: When mor talbe enable HoodieRetryWrapperFileSystem through the configuration `hoodie.filesystem.operation.retry.enable=true` ,log file in hdfs only has 1 replication.) Summary: MOR Table Log file has only one replication (was: MOR Table Log file have only one replication) > MOR Table Log file has only one replication > --- > > Key: HUDI-5917 > URL: https://issues.apache.org/jira/browse/HUDI-5917 > Project: Apache Hudi > Issue Type: Bug >Reporter: sandy du >Priority: Major > > When mor talbe enable HoodieRetryWrapperFileSystem through the configuration > `hoodie.filesystem.operation.retry.enable=true` ,log file in hdfs only has > one replication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5917) MOR Table Log file have only one replication
[ https://issues.apache.org/jira/browse/HUDI-5917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandy du updated HUDI-5917: --- Summary: MOR Table Log file have only one replication (was: MOR ) > MOR Table Log file have only one replication > > > Key: HUDI-5917 > URL: https://issues.apache.org/jira/browse/HUDI-5917 > Project: Apache Hudi > Issue Type: Bug >Reporter: sandy du >Priority: Major > > When mor talbe enable HoodieRetryWrapperFileSystem through the configuration > `hoodie.filesystem.operation.retry.enable=true` ,log file in hdfs only has 1 > replication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5917) MOR
sandy du created HUDI-5917: -- Summary: MOR Key: HUDI-5917 URL: https://issues.apache.org/jira/browse/HUDI-5917 Project: Apache Hudi Issue Type: Bug Reporter: sandy du When mor talbe enable HoodieRetryWrapperFileSystem through the configuration `hoodie.filesystem.operation.retry.enable=true` ,log file in hdfs only has 1 replication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #7956: [HUDI-5797] fix use bulk insert error as row
hudi-bot commented on PR #7956: URL: https://github.com/apache/hudi/pull/7956#issuecomment-1463323664 ## CI report: * 6dc701ed6011cb5983de68e88b9a67522d1e8db3 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15645) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] XuQianJin-Stars commented on a diff in pull request #8133: [HUDI-5904] support more than one update actions in merge into table
XuQianJin-Stars commented on code in PR #8133: URL: https://github.com/apache/hudi/pull/8133#discussion_r1131979301 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala: ## @@ -115,6 +116,65 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase with ScalaAssertionSuppo }) } + test("Test MergeInto with more than once update actions") { +withRecordType()(withTempDir {tmp => + val conf = new SparkConf().setAppName("insertDatasToHudi").setMaster("local[*]") + val spark = SparkSession.builder().config(conf) +.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") Review Comment: Both `conf` and `spark` can be removed, both in the `HoodieSparkSqlTestBase` class. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] kkrugler commented on issue #8147: [SUPPORT] Missing dependency on hive-exec (core)
kkrugler commented on issue #8147: URL: https://github.com/apache/hudi/issues/8147#issuecomment-1463256084 The `hudi-flink-bundle` pom has what seems like a very long list of transitive dependencies (from running `mvn dependency:tree` in the `packaging/flink-hudi-bundle/` directory). I'm wondering why you don't think this would pull in jars that create conflicts with other jars being used in a workflow... ``` [INFO] org.apache.hudi:hudi-flink1.16-bundle:jar:0.14.0-SNAPSHOT [INFO] +- org.apache.hudi:hudi-common:jar:0.14.0-SNAPSHOT:compile [INFO] | +- org.openjdk.jol:jol-core:jar:0.16:compile [INFO] | +- com.github.ben-manes.caffeine:caffeine:jar:2.9.1:compile [INFO] | | +- org.checkerframework:checker-qual:jar:3.10.0:compile [INFO] | | \- com.google.errorprone:error_prone_annotations:jar:2.5.1:compile [INFO] | +- org.apache.httpcomponents:fluent-hc:jar:4.4.1:compile [INFO] | | \- commons-logging:commons-logging:jar:1.2:compile [INFO] | +- org.apache.httpcomponents:httpclient:jar:4.4.1:compile [INFO] | +- org.apache.hbase:hbase-client:jar:2.4.9:compile [INFO] | | +- org.apache.hbase.thirdparty:hbase-shaded-protobuf:jar:3.5.1:compile [INFO] | | +- org.apache.hbase:hbase-common:jar:2.4.9:compile [INFO] | | | +- org.apache.hbase:hbase-logging:jar:2.4.9:compile [INFO] | | | \- org.apache.hbase.thirdparty:hbase-shaded-gson:jar:3.5.1:compile [INFO] | | +- org.apache.hbase:hbase-hadoop-compat:jar:2.4.9:compile [INFO] | | +- org.apache.hbase:hbase-hadoop2-compat:jar:2.4.9:compile [INFO] | | | \- javax.activation:javax.activation-api:jar:1.2.0:runtime [INFO] | | +- org.apache.hbase:hbase-protocol-shaded:jar:2.4.9:compile [INFO] | | +- org.apache.hbase:hbase-protocol:jar:2.4.9:compile [INFO] | | +- org.apache.hbase.thirdparty:hbase-shaded-miscellaneous:jar:3.5.1:compile [INFO] | | +- org.apache.hbase.thirdparty:hbase-shaded-netty:jar:3.5.1:compile [INFO] | | +- org.apache.htrace:htrace-core4:jar:4.2.0-incubating:compile [INFO] | | +- org.jruby.jcodings:jcodings:jar:1.0.55:compile [INFO] | | +- org.jruby.joni:joni:jar:2.1.31:compile [INFO] | | +- org.apache.commons:commons-crypto:jar:1.0.0:compile [INFO] | | \- org.apache.hadoop:hadoop-auth:jar:2.10.1:provided [INFO] | | +- com.nimbusds:nimbus-jose-jwt:jar:7.9:provided [INFO] | | | \- com.github.stephenc.jcip:jcip-annotations:jar:1.0-1:provided [INFO] | | \- org.apache.directory.server:apacheds-kerberos-codec:jar:2.0.0-M15:provided [INFO] | |+- org.apache.directory.server:apacheds-i18n:jar:2.0.0-M15:provided [INFO] | |+- org.apache.directory.api:api-asn1-api:jar:1.0.0-M20:provided [INFO] | |\- org.apache.directory.api:api-util:jar:1.0.0-M20:provided [INFO] | +- org.apache.hbase:hbase-server:jar:2.4.9:compile [INFO] | | +- org.apache.hbase:hbase-http:jar:2.4.9:compile [INFO] | | | +- org.apache.hbase.thirdparty:hbase-shaded-jetty:jar:3.5.1:compile [INFO] | | | +- org.apache.hbase.thirdparty:hbase-shaded-jersey:jar:3.5.1:compile [INFO] | | | | +- jakarta.ws.rs:jakarta.ws.rs-api:jar:2.1.6:compile [INFO] | | | | +- jakarta.annotation:jakarta.annotation-api:jar:1.3.5:compile [INFO] | | | | +- jakarta.validation:jakarta.validation-api:jar:2.0.2:compile [INFO] | | | | \- org.glassfish.hk2.external:jakarta.inject:jar:2.6.1:compile [INFO] | | | \- javax.ws.rs:javax.ws.rs-api:jar:2.1.1:compile [INFO] | | +- org.apache.hbase:hbase-procedure:jar:2.4.9:compile [INFO] | | +- org.apache.hbase:hbase-zookeeper:jar:2.4.9:compile [INFO] | | +- org.apache.hbase:hbase-replication:jar:2.4.9:compile [INFO] | | +- org.apache.hbase:hbase-metrics-api:jar:2.4.9:compile [INFO] | | +- org.apache.hbase:hbase-metrics:jar:2.4.9:compile [INFO] | | +- org.apache.hbase:hbase-asyncfs:jar:2.4.9:compile [INFO] | | +- org.glassfish.web:javax.servlet.jsp:jar:2.3.2:compile [INFO] | | | \- org.glassfish:javax.el:jar:3.0.1-b12:provided [INFO] | | +- javax.servlet.jsp:javax.servlet.jsp-api:jar:2.3.1:compile [INFO] | | +- org.apache.commons:commons-math3:jar:3.6.1:compile [INFO] | | +- org.apache.hadoop:hadoop-distcp:jar:2.10.0:compile [INFO] | | \- org.apache.hadoop:hadoop-annotations:jar:2.10.0:compile [INFO] | +- commons-io:commons-io:jar:2.11.0:compile [INFO] | +- org.lz4:lz4-java:jar:1.8.0:compile [INFO] | \- com.lmax:disruptor:jar:3.4.2:compile [INFO] +- org.apache.hudi:hudi-client-common:jar:0.14.0-SNAPSHOT:compile [INFO] | +- com.github.davidmoten:hilbert-curve:jar:0.2.2:compile [INFO] | | \- com.github.davidmoten:guava-mini:jar:0.1.3:compile [INFO] | +- io.dropwizard.metrics:metrics-graphite:jar:4.1.1:compile [INFO] | +- io.dropwizard.metrics:metrics-core:jar:4.1.1:compile [INFO] | +- io.dropwizard.metrics:metrics-jmx:jar:4.1.1:compile [IN
[GitHub] [hudi] kkrugler commented on issue #8147: [SUPPORT] Missing dependency on hive-exec (core)
kkrugler commented on issue #8147: URL: https://github.com/apache/hudi/issues/8147#issuecomment-1463250189 I was also confused by that. I think when the `HoodieInputFormatUtils` class is loaded via the call from `WriteProfiles. getCommitMetadata()` to `HoodieInputFormatUtils.getCommitMetadata()`, this indirectly triggers a reference to `MapredParquetInputFormat` (e.g. maybe through a static class reference?). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #8147: [SUPPORT] Missing dependency on hive-exec (core)
danny0405 commented on issue #8147: URL: https://github.com/apache/hudi/issues/8147#issuecomment-1463239717 The error stack trace confused me a log, because `WriteProfiles.getCommitMetadata` does not depend on the `MapredParquetInputFormat` in the code path, why it tries to load it then? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8149: [HUDI-5915] Fixed load ckpMeatadata error when using minio
hudi-bot commented on PR #8149: URL: https://github.com/apache/hudi/pull/8149#issuecomment-1463234399 ## CI report: * b04749aba0c507eb67fd6dd756e21ed7f1e3535e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15650) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8139: [HUDI-5909] Reuse hive client if possible
hudi-bot commented on PR #8139: URL: https://github.com/apache/hudi/pull/8139#issuecomment-1463234376 ## CI report: * 0bcd6490f856475266dfff3882728aa1392727f1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15628) * 075563866d156e36afe34780d5fb132d6da57251 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15649) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi
hudi-bot commented on PR #8107: URL: https://github.com/apache/hudi/pull/8107#issuecomment-1463234314 ## CI report: * 35aed635391309c3c6c4b3794044bba53b3468ef Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15603) * 9dfbe3e6135456e7f8c79513270eb5e7e4ed123d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15648) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8149: [HUDI-5915] Fixed load ckpMeatadata error when using minio
hudi-bot commented on PR #8149: URL: https://github.com/apache/hudi/pull/8149#issuecomment-1463230874 ## CI report: * b04749aba0c507eb67fd6dd756e21ed7f1e3535e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8139: [HUDI-5909] Reuse hive client if possible
hudi-bot commented on PR #8139: URL: https://github.com/apache/hudi/pull/8139#issuecomment-1463230840 ## CI report: * 0bcd6490f856475266dfff3882728aa1392727f1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15628) * 075563866d156e36afe34780d5fb132d6da57251 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi
hudi-bot commented on PR #8107: URL: https://github.com/apache/hudi/pull/8107#issuecomment-1463230758 ## CI report: * 35aed635391309c3c6c4b3794044bba53b3468ef Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15603) * 9dfbe3e6135456e7f8c79513270eb5e7e4ed123d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8133: [HUDI-5904] support more than one update actions in merge into table
hudi-bot commented on PR #8133: URL: https://github.com/apache/hudi/pull/8133#issuecomment-1463226268 ## CI report: * 8e3fad5fa9e9c64e7e345a317865f6fe6a9a7620 UNKNOWN * 0268541001db5b561328bdf9390ee2cb5e92 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15644) * a690c5122694914f975ebbb717e06630ac3b5902 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15646) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-5915) listStatus error caused by minio storage
[ https://issues.apache.org/jira/browse/HUDI-5915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-5915: - Labels: pull-request-available (was: ) > listStatus error caused by minio storage > > > Key: HUDI-5915 > URL: https://issues.apache.org/jira/browse/HUDI-5915 > Project: Apache Hudi > Issue Type: Bug >Reporter: linfey.nie >Assignee: linfey.nie >Priority: Major > Labels: pull-request-available > > When the storage is minio, the empty folder is assumed not to exist, causing > listStatus to report an error and causing the entire program to break -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] linfey90 opened a new pull request, #8149: [HUDI-5915] Fixed load ckpMeatadata error when using minio
linfey90 opened a new pull request, #8149: URL: https://github.com/apache/hudi/pull/8149 ### Change Logs When the storage is minio, the empty folder is assumed not to exist, causing listStatus to report an error and causing the entire program to break. When we created the table, ckp_meta was empty, causing an error when we called listStatus later (such as insert). This pr fix ### Impact no ### Risk level (write none, low medium or high below) low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-5916) flink bundle jar includes the hive-exec core by default
Danny Chen created HUDI-5916: Summary: flink bundle jar includes the hive-exec core by default Key: HUDI-5916 URL: https://issues.apache.org/jira/browse/HUDI-5916 Project: Apache Hudi Issue Type: Improvement Components: dependencies Reporter: Danny Chen Fix For: 0.13.1, 0.14.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] danny0405 commented on issue #8147: [SUPPORT] Missing dependency on hive-exec (core)
danny0405 commented on issue #8147: URL: https://github.com/apache/hudi/issues/8147#issuecomment-1463211902 On cluster, you should use the bundle jar instead, and yeah, the default bundler jar does not package the hive-exec, which should be fixed: https://issues.apache.org/jira/browse/HUDI-5916 `hudi-flink` pom already includes the `hive-exec` dependency: https://github.com/apache/hudi/blob/2675118d95c7a087cd9222a05cd7376eb0a31aad/hudi-flink-datasource/hudi-flink/pom.xml#L287, but it does not package into the released jar, that is by design, we only introduce the hive jar into the bundle jars. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-5915) listStatus error caused by minio storage
[ https://issues.apache.org/jira/browse/HUDI-5915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] linfey.nie reassigned HUDI-5915: Assignee: linfey.nie > listStatus error caused by minio storage > > > Key: HUDI-5915 > URL: https://issues.apache.org/jira/browse/HUDI-5915 > Project: Apache Hudi > Issue Type: Bug >Reporter: linfey.nie >Assignee: linfey.nie >Priority: Major > > When the storage is minio, the empty folder is assumed not to exist, causing > listStatus to report an error and causing the entire program to break -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5915) listStatus error caused by minio storage
linfey.nie created HUDI-5915: Summary: listStatus error caused by minio storage Key: HUDI-5915 URL: https://issues.apache.org/jira/browse/HUDI-5915 Project: Apache Hudi Issue Type: Bug Reporter: linfey.nie When the storage is minio, the empty folder is assumed not to exist, causing listStatus to report an error and causing the entire program to break -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-5914) Fix for RowData class cast exception
[ https://issues.apache.org/jira/browse/HUDI-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-5914. Resolution: Fixed Fixed via master branch: 2675118d95c7a087cd9222a05cd7376eb0a31aad > Fix for RowData class cast exception > > > Key: HUDI-5914 > URL: https://issues.apache.org/jira/browse/HUDI-5914 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Danny Chen >Priority: Major > Fix For: 0.13.1, 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #8133: [HUDI-5904] support more than one update actions in merge into table
hudi-bot commented on PR #8133: URL: https://github.com/apache/hudi/pull/8133#issuecomment-1463199917 ## CI report: * 8e3fad5fa9e9c64e7e345a317865f6fe6a9a7620 UNKNOWN * 5b8a43f4b2f18352738b6e9c9a183a1bde5c4540 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15639) * 0268541001db5b561328bdf9390ee2cb5e92 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15644) * a690c5122694914f975ebbb717e06630ac3b5902 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (bab75b6c60c -> 2675118d95c)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from bab75b6c60c [HUDI-4911] Following the first patch, fix the inefficient code (#8127) add 2675118d95c [HUDI-5941] Fix for RowData class cast exception (#8145) No new revisions were added by this update. Summary of changes: .../table/format/cow/vector/reader/ParquetColumnarRowSplitReader.java | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
[GitHub] [hudi] danny0405 merged pull request #8145: [HUDI-5941] Fix for RowData class cast exception
danny0405 merged PR #8145: URL: https://github.com/apache/hudi/pull/8145 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7956: [HUDI-5797] fix use bulk insert error as row
hudi-bot commented on PR #7956: URL: https://github.com/apache/hudi/pull/7956#issuecomment-1463195659 ## CI report: * 5bd4d5c4de8fc54bf93fb7fd252b6e61fda85373 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15194) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15233) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15247) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15641) * 6dc701ed6011cb5983de68e88b9a67522d1e8db3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15645) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #8145: [HUDI-5941] Fix for RowData class cast exception
danny0405 commented on PR #8145: URL: https://github.com/apache/hudi/pull/8145#issuecomment-1463195670 The test failure: https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=15642&view=logs&j=3b6e910d-b98f-5de6-b9cb-1e5ff571f5de&t=30b5aae4-0ea0-5566-42d0-febf71a7061a&l=682866 is not caused by the change, so would merge it soon~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-5914) Fix for RowData class cast exception
Danny Chen created HUDI-5914: Summary: Fix for RowData class cast exception Key: HUDI-5914 URL: https://issues.apache.org/jira/browse/HUDI-5914 Project: Apache Hudi Issue Type: Bug Components: writer-core Reporter: Danny Chen Fix For: 0.13.1, 0.14.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #7956: [HUDI-5797] fix use bulk insert error as row
hudi-bot commented on PR #7956: URL: https://github.com/apache/hudi/pull/7956#issuecomment-1463191824 ## CI report: * 5bd4d5c4de8fc54bf93fb7fd252b6e61fda85373 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15194) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15233) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15247) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15641) * 6dc701ed6011cb5983de68e88b9a67522d1e8db3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8133: [HUDI-5904] support more than one update actions in merge into table
hudi-bot commented on PR #8133: URL: https://github.com/apache/hudi/pull/8133#issuecomment-1463187972 ## CI report: * 8e3fad5fa9e9c64e7e345a317865f6fe6a9a7620 UNKNOWN * 5b8a43f4b2f18352738b6e9c9a183a1bde5c4540 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15639) * 0268541001db5b561328bdf9390ee2cb5e92 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15644) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xuzifu666 commented on pull request #8133: [HUDI-5904] support more than one update actions in merge into table
xuzifu666 commented on PR #8133: URL: https://github.com/apache/hudi/pull/8133#issuecomment-1463177403 > Will cause data quality problems if we only remove the check, if source table without precombineField, look like hudi will add the first updateAction assignments vlaue expre which key is target precombineField to source df, because we need dedup before use payload And need add more test: cow/mor > > * different updateAction with diff precombine field expr > * source table without precombineField (like target.precombineField = source.otherfield) > a simple test like this: > > ```sql > merge into $cowTableName t0 > using ( > select 1 as id, 'a1_n_6' as name, 6 as price, 1010 as v_ts, '1' as flag union > select 2 as id, 'a2_n_6' as name, 6 as price, 1010 as v_ts, '2' as flag union > select 6 as id, 'a3_n_6' as name, 6 as price, 1010 as v_ts, '1' as flag > ) s0 >on s0.id = t0.id >when matched and flag = '1' then update set >id = s0.id, name = s0.name, ts = 1003 >when matched and flag = '2' then update set >id = s0.id, price = s0.price, ts = s0.v_ts + 2 >when not matched and flag = '1' then insert * > ``` yes,but mostly business is upsert only one record. i thought this is not impact bussiness in one record upsert -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhuoluoy commented on issue #7417: [SUPPORT] With HoodieROTablePathFilter is too slow load normal parquets in hudi release
zhuoluoy commented on issue #7417: URL: https://github.com/apache/hudi/issues/7417#issuecomment-1463139378 Should we open an Apache JIRA for this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhuoluoy commented on issue #7417: [SUPPORT] With HoodieROTablePathFilter is too slow load normal parquets in hudi release
zhuoluoy commented on issue #7417: URL: https://github.com/apache/hudi/issues/7417#issuecomment-1463137341 Actually, for legacy MapReduce, This patch is very important. Without this patch, HoodiROTablePathFilter will be thousands times slower. Can we just brinig back https://github.com/apache/hudi/pull/3719 and fix the NPE? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhuoluoy commented on pull request #3719: [HUDI-2489]Tuning HoodieROTablePathFilter by caching hoodieTableFileSystemView, aiming to reduce unnecessary list/get requests
zhuoluoy commented on PR #3719: URL: https://github.com/apache/hudi/pull/3719#issuecomment-1463135838 Actually, for legacy MapReduce. This patch is very important. Without this patch, HoodiROTablePathFilter will be thousands times slower. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8133: [HUDI-5904] support more than one update actions in merge into table
hudi-bot commented on PR #8133: URL: https://github.com/apache/hudi/pull/8133#issuecomment-1463129575 ## CI report: * 8e3fad5fa9e9c64e7e345a317865f6fe6a9a7620 UNKNOWN * 5b8a43f4b2f18352738b6e9c9a183a1bde5c4540 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15639) * 0268541001db5b561328bdf9390ee2cb5e92 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1243) Debug test-suite docker execution
[ https://issues.apache.org/jira/browse/HUDI-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1243: - Issue Type: Task (was: Bug) > Debug test-suite docker execution > - > > Key: HUDI-1243 > URL: https://issues.apache.org/jira/browse/HUDI-1243 > Project: Apache Hudi > Issue Type: Task > Components: Testing, tests-ci >Affects Versions: 0.8.0 >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Minor > > Debug and fix test-suite docker execution. We should have a smooth run where > in end to end COW and MOR test suite runs w/o any issues in our local dev box > (laptop) > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1243) Debug test-suite docker execution
[ https://issues.apache.org/jira/browse/HUDI-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1243: - Fix Version/s: (was: 0.13.1) > Debug test-suite docker execution > - > > Key: HUDI-1243 > URL: https://issues.apache.org/jira/browse/HUDI-1243 > Project: Apache Hudi > Issue Type: Bug > Components: Testing, tests-ci >Affects Versions: 0.8.0 >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Minor > > Debug and fix test-suite docker execution. We should have a smooth run where > in end to end COW and MOR test suite runs w/o any issues in our local dev box > (laptop) > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5824) COMBINE_BEFORE_UPSERT=false option does not work for upsert
[ https://issues.apache.org/jira/browse/HUDI-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-5824: - Priority: Critical (was: Minor) > COMBINE_BEFORE_UPSERT=false option does not work for upsert > --- > > Key: HUDI-5824 > URL: https://issues.apache.org/jira/browse/HUDI-5824 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.1, 0.12.2, 0.13.0 >Reporter: kazdy >Assignee: kazdy >Priority: Critical > Labels: pull-request-available > Fix For: 0.13.1, 0.12.3 > > > hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala > shouldCombine does not take into account the situation where the write > operation is UPSERT but COMBINE_BEFORE_UPSERT is false. > Currently, Hudi always combines records on UPSERT, and option > COMBINE_BEFORE_UPSERT is not honored. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4733) Flag emitDelete is inconsistent in HoodieTableSource and MergeOnReadInputFormat
[ https://issues.apache.org/jira/browse/HUDI-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4733: - Fix Version/s: 0.14.0 (was: 0.13.1) > Flag emitDelete is inconsistent in HoodieTableSource and > MergeOnReadInputFormat > --- > > Key: HUDI-4733 > URL: https://issues.apache.org/jira/browse/HUDI-4733 > Project: Apache Hudi > Issue Type: Bug > Components: flink, flink-sql >Reporter: nonggia.liang >Assignee: Zhaojing Yu >Priority: Minor > Fix For: 0.14.0 > > Attachments: image 1.png > > > When reading a MOR table in flink, we encountered an exception from flink > runtime ( as shown in image1), which complained the table source should not > emit a retract record. > !image 1.png! > I think here is the cause, in HoodieTableSource: > {code:java} > @Override > public ChangelogMode getChangelogMode() { > // when read as streaming and changelog mode is enabled, emit as FULL mode; > // when all the changes are compacted or read as batch, emit as INSERT mode. > return OptionsResolver.emitChangelog(conf) ? ChangelogModes.FULL : > ChangelogMode.insertOnly(); > } {code} > {code:java} > private InputFormat getStreamInputFormat() { > ... > if (FlinkOptions.QUERY_TYPE_SNAPSHOT.equals(queryType)) { > final HoodieTableType tableType = > HoodieTableType.valueOf(this.conf.getString(FlinkOptions.TABLE_TYPE)); > boolean emitDelete = tableType == HoodieTableType.MERGE_ON_READ; > return mergeOnReadInputFormat(rowType, requiredRowType, tableAvroSchema, > rowDataType, Collections.emptyList(), emitDelete); } > ... > } > {code} > With these options: > {{'table.type'}} {{= }}{{'MERGE_ON_READ'}} > {{'read.streaming.enabled'}} {{= }}{{'true'}} > {{The HoodieTableSource}} annouces it has only INSERT changelog, > but MergeOnReadInputFormat will emit delete. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3616) Ingestigate mor async compact integ test failure
[ https://issues.apache.org/jira/browse/HUDI-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3616: - Fix Version/s: 0.14.0 (was: 0.13.1) > Ingestigate mor async compact integ test failure > > > Key: HUDI-3616 > URL: https://issues.apache.org/jira/browse/HUDI-3616 > Project: Apache Hudi > Issue Type: Bug > Components: tests-ci >Reporter: sivabalan narayanan >Priority: Minor > Fix For: 0.14.0 > > > mor async compact integ test validation is failing. > > {code:java} > 22/03/14 01:31:28 WARN DagNode: Validation using data from input path > /home/hadoop/staging/input//*/* > 266722/03/14 01:31:28 INFO ValidateDatasetNode: Validate data in target hudi > path /home/hadoop/staging/output//*/*/* > 266822/03/14 01:31:31 ERROR DagNode: Data set validation failed. Total count > in hudi 64400, input df count 64400 > 266922/03/14 01:31:31 INFO DagScheduler: Forcing shutdown of executor > service, this might kill running tasks > 267022/03/14 01:31:31 ERROR HoodieTestSuiteJob: Failed to run Test Suite > 2671java.util.concurrent.ExecutionException: java.lang.AssertionError: Hudi > contents does not match contents input data. > 2672at java.util.concurrent.FutureTask.report(FutureTask.java:122) > 2673at java.util.concurrent.FutureTask.get(FutureTask.java:206) > 2674at > org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.execute(DagScheduler.java:113) > 2675at > org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.schedule(DagScheduler.java:68) > 2676at > org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.runTestSuite(HoodieTestSuiteJob.java:203) > 2677at > org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.main(HoodieTestSuiteJob.java:170) > 2678at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2679at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2680at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2681at java.lang.reflect.Method.invoke(Method.java:498) > 2682at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > 2683at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) > 2684at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) > 2685at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) > 2686at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) > 2687at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043) > 2688at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052) > 2689at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > 2690Caused by: java.lang.AssertionError: Hudi contents does not match > contents input data. > 2691at > org.apache.hudi.integ.testsuite.dag.nodes.BaseValidateDatasetNode.execute(BaseValidateDatasetNode.java:119) > 2692at > org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139) > 2693at > org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105) > 2694at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > 2695at java.util.concurrent.FutureTask.run(FutureTask.java:266) > 2696at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > 2697at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > 2698at java.lang.Thread.run(Thread.java:748) > 2699Exception in thread "main" org.apache.hudi.exception.HoodieException: > Failed to run Test Suite > 2700at > org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.runTestSuite(HoodieTestSuiteJob.java:208) > 2701at > org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.main(HoodieTestSuiteJob.java:170) > 2702at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2703at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2704at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2705at java.lang.reflect.Method.invoke(Method.java:498) > 2706at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > 2707at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) > 2708at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) > 2709at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) > 2710at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) > 2711at > org.apache.spark.deploy.SparkSubmi
[jira] [Updated] (HUDI-2954) Code cleanup: HFileDataBock - using integer keys is never used
[ https://issues.apache.org/jira/browse/HUDI-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2954: - Fix Version/s: 0.14.0 (was: 0.13.1) > Code cleanup: HFileDataBock - using integer keys is never used > --- > > Key: HUDI-2954 > URL: https://issues.apache.org/jira/browse/HUDI-2954 > Project: Apache Hudi > Issue Type: Improvement > Components: code-quality, metadata >Reporter: Manoj Govindassamy >Assignee: Ethan Guo >Priority: Minor > Fix For: 0.14.0 > > > > KeyField can never be empty for File. If so, there is really no need for > falling back to sequential integer keys in the > HFileDataBlock::serializeRecords() code path. > > {noformat} > // Build the record key > final Field schemaKeyField = > records.get(0).getSchema().getField(this.keyField); > if (schemaKeyField == null) { > // Missing key metadata field. Use an integer sequence key instead. > useIntegerKey = true; > keySize = (int) Math.ceil(Math.log(records.size())) + 1; > } > while (itr.hasNext()) { > IndexedRecord record = itr.next(); > String recordKey; > if (useIntegerKey) { > recordKey = String.format("%" + keySize + "s", key++); > } else { > recordKey = record.get(schemaKeyField.pos()).toString(); > } > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5824) COMBINE_BEFORE_UPSERT=false option does not work for upsert
[ https://issues.apache.org/jira/browse/HUDI-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-5824: - Fix Version/s: 0.12.3 > COMBINE_BEFORE_UPSERT=false option does not work for upsert > --- > > Key: HUDI-5824 > URL: https://issues.apache.org/jira/browse/HUDI-5824 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.1, 0.12.2, 0.13.0 >Reporter: kazdy >Assignee: kazdy >Priority: Minor > Labels: pull-request-available > Fix For: 0.13.1, 0.12.3 > > > hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala > shouldCombine does not take into account the situation where the write > operation is UPSERT but COMBINE_BEFORE_UPSERT is false. > Currently, Hudi always combines records on UPSERT, and option > COMBINE_BEFORE_UPSERT is not honored. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3646) The Hudi update syntax should not modify the nullability attribute of a column
[ https://issues.apache.org/jira/browse/HUDI-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3646: - Priority: Critical (was: Minor) > The Hudi update syntax should not modify the nullability attribute of a column > -- > > Key: HUDI-3646 > URL: https://issues.apache.org/jira/browse/HUDI-3646 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Affects Versions: 0.10.1 > Environment: spark3.1.2 >Reporter: Tao Meng >Assignee: Alexey Kudinkin >Priority: Critical > Fix For: 0.13.1 > > > now, when we use sparksql to update hudi table, we find that hudi will > change the nullability attribute of a column > eg: > {code:java} > // code placeholder > val tableName = generateTableName > val tablePath = s"${new Path(tmp.getCanonicalPath, > tableName).toUri.toString}" > // create table > spark.sql( >s""" > |create table $tableName ( > | id int, > | name string, > | price double, > | ts long > |) using hudi > | location '$tablePath' > | options ( > | type = '$tableType', > | primaryKey = 'id', > | preCombineField = 'ts' > | ) > """.stripMargin) > // insert data to table > spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000") > spark.sql(s"select * from $tableName").printSchema() > // update data > spark.sql(s"update $tableName set price = 20 where id = 1") > spark.sql(s"select * from $tableName").printSchema() {code} > > |-- _hoodie_commit_time: string (nullable = true) > |-- _hoodie_commit_seqno: string (nullable = true) > |-- _hoodie_record_key: string (nullable = true) > |-- _hoodie_partition_path: string (nullable = true) > |-- _hoodie_file_name: string (nullable = true) > |-- id: integer (nullable = true) > |-- name: string (nullable = true) > *|-- price: double (nullable = true)* > |-- ts: long (nullable = true) > > |-- _hoodie_commit_time: string (nullable = true) > |-- _hoodie_commit_seqno: string (nullable = true) > |-- _hoodie_record_key: string (nullable = true) > |-- _hoodie_partition_path: string (nullable = true) > |-- _hoodie_file_name: string (nullable = true) > |-- id: integer (nullable = true) > |-- name: string (nullable = true) > *|-- price: double (nullable = false )* > |-- ts: long (nullable = true) > > the nullable attribute of price has been changed to false, This is not the > result we want -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3646) The Hudi update syntax should not modify the nullability attribute of a column
[ https://issues.apache.org/jira/browse/HUDI-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3646: - Fix Version/s: 0.12.3 > The Hudi update syntax should not modify the nullability attribute of a column > -- > > Key: HUDI-3646 > URL: https://issues.apache.org/jira/browse/HUDI-3646 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Affects Versions: 0.10.1 > Environment: spark3.1.2 >Reporter: Tao Meng >Assignee: Alexey Kudinkin >Priority: Critical > Fix For: 0.13.1, 0.12.3 > > > now, when we use sparksql to update hudi table, we find that hudi will > change the nullability attribute of a column > eg: > {code:java} > // code placeholder > val tableName = generateTableName > val tablePath = s"${new Path(tmp.getCanonicalPath, > tableName).toUri.toString}" > // create table > spark.sql( >s""" > |create table $tableName ( > | id int, > | name string, > | price double, > | ts long > |) using hudi > | location '$tablePath' > | options ( > | type = '$tableType', > | primaryKey = 'id', > | preCombineField = 'ts' > | ) > """.stripMargin) > // insert data to table > spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000") > spark.sql(s"select * from $tableName").printSchema() > // update data > spark.sql(s"update $tableName set price = 20 where id = 1") > spark.sql(s"select * from $tableName").printSchema() {code} > > |-- _hoodie_commit_time: string (nullable = true) > |-- _hoodie_commit_seqno: string (nullable = true) > |-- _hoodie_record_key: string (nullable = true) > |-- _hoodie_partition_path: string (nullable = true) > |-- _hoodie_file_name: string (nullable = true) > |-- id: integer (nullable = true) > |-- name: string (nullable = true) > *|-- price: double (nullable = true)* > |-- ts: long (nullable = true) > > |-- _hoodie_commit_time: string (nullable = true) > |-- _hoodie_commit_seqno: string (nullable = true) > |-- _hoodie_record_key: string (nullable = true) > |-- _hoodie_partition_path: string (nullable = true) > |-- _hoodie_file_name: string (nullable = true) > |-- id: integer (nullable = true) > |-- name: string (nullable = true) > *|-- price: double (nullable = false )* > |-- ts: long (nullable = true) > > the nullable attribute of price has been changed to false, This is not the > result we want -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5292) Exclude the test resources from every module packaging
[ https://issues.apache.org/jira/browse/HUDI-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-5292: - Priority: Major (was: Critical) > Exclude the test resources from every module packaging > -- > > Key: HUDI-5292 > URL: https://issues.apache.org/jira/browse/HUDI-5292 > Project: Apache Hudi > Issue Type: Improvement > Components: dependencies >Reporter: Sagar Sumit >Priority: Major > Fix For: 0.13.1, 0.12.3 > > > Exclude the test resources, especially the properties files that conflict > with user-provided resources, from every module. This is a followup to > https://github.com/apache/hudi/pull/7310#issuecomment-1328728297 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5292) Exclude the test resources from every module packaging
[ https://issues.apache.org/jira/browse/HUDI-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-5292: - Fix Version/s: 0.12.3 > Exclude the test resources from every module packaging > -- > > Key: HUDI-5292 > URL: https://issues.apache.org/jira/browse/HUDI-5292 > Project: Apache Hudi > Issue Type: Improvement > Components: dependencies >Reporter: Sagar Sumit >Priority: Critical > Fix For: 0.13.1, 0.12.3 > > > Exclude the test resources, especially the properties files that conflict > with user-provided resources, from every module. This is a followup to > https://github.com/apache/hudi/pull/7310#issuecomment-1328728297 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5292) Exclude the test resources from every module packaging
[ https://issues.apache.org/jira/browse/HUDI-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-5292: - Priority: Critical (was: Major) > Exclude the test resources from every module packaging > -- > > Key: HUDI-5292 > URL: https://issues.apache.org/jira/browse/HUDI-5292 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Sagar Sumit >Priority: Critical > Fix For: 0.13.1 > > > Exclude the test resources, especially the properties files that conflict > with user-provided resources, from every module. This is a followup to > https://github.com/apache/hudi/pull/7310#issuecomment-1328728297 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5292) Exclude the test resources from every module packaging
[ https://issues.apache.org/jira/browse/HUDI-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-5292: - Component/s: dependencies > Exclude the test resources from every module packaging > -- > > Key: HUDI-5292 > URL: https://issues.apache.org/jira/browse/HUDI-5292 > Project: Apache Hudi > Issue Type: Improvement > Components: dependencies >Reporter: Sagar Sumit >Priority: Critical > Fix For: 0.13.1 > > > Exclude the test resources, especially the properties files that conflict > with user-provided resources, from every module. This is a followup to > https://github.com/apache/hudi/pull/7310#issuecomment-1328728297 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5037) Upgrade libthrift in integ-test-bundle
[ https://issues.apache.org/jira/browse/HUDI-5037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-5037: - Fix Version/s: 0.13.0 (was: 0.13.1) > Upgrade libthrift in integ-test-bundle > -- > > Key: HUDI-5037 > URL: https://issues.apache.org/jira/browse/HUDI-5037 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-5037) Upgrade libthrift in integ-test-bundle
[ https://issues.apache.org/jira/browse/HUDI-5037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-5037. Resolution: Fixed > Upgrade libthrift in integ-test-bundle > -- > > Key: HUDI-5037 > URL: https://issues.apache.org/jira/browse/HUDI-5037 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi
nsivabalan commented on code in PR #8107: URL: https://github.com/apache/hudi/pull/8107#discussion_r1131853630 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala: ## @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi + +import org.apache.avro.generic.GenericRecord +import org.apache.hudi.DataSourceWriteOptions.INSERT_DROP_DUPS +import org.apache.hudi.common.config.HoodieConfig +import org.apache.hudi.common.model.{HoodieRecord, WriteOperationType} +import org.apache.hudi.common.table.HoodieTableConfig +import org.apache.hudi.config.HoodieWriteConfig +import org.apache.hudi.exception.HoodieException +import org.apache.spark.TaskContext + +object AutoRecordKeyGenerationUtils { + + // supported operation types when auto generation of record keys is enabled. + val supportedOperations: Set[String] = +Set(WriteOperationType.INSERT, WriteOperationType.BULK_INSERT, WriteOperationType.DELETE, Review Comment: nope. its feasible via spark-sql. will tackle this in phase 2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4557) Support validation of column stats of avro log files in tests
[ https://issues.apache.org/jira/browse/HUDI-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4557: - Fix Version/s: 0.12.3 > Support validation of column stats of avro log files in tests > - > > Key: HUDI-4557 > URL: https://issues.apache.org/jira/browse/HUDI-4557 > Project: Apache Hudi > Issue Type: Test > Components: tests-ci >Reporter: Ethan Guo >Priority: Critical > Fix For: 0.13.1, 0.12.3 > > > In TestColumnStatsIndex, when comparing the column stats with the actual data > files, only parquet files are supported. We need to support avro log files > as well. Note that, to validate the column stat of avro log files, we use > resource files storing the expected column stat table content for validation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4557) Support validation of column stats of avro log files in tests
[ https://issues.apache.org/jira/browse/HUDI-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4557: - Priority: Critical (was: Major) > Support validation of column stats of avro log files in tests > - > > Key: HUDI-4557 > URL: https://issues.apache.org/jira/browse/HUDI-4557 > Project: Apache Hudi > Issue Type: Test > Components: tests-ci >Reporter: Ethan Guo >Priority: Critical > Fix For: 0.13.1 > > > In TestColumnStatsIndex, when comparing the column stats with the actual data > files, only parquet files are supported. We need to support avro log files > as well. Note that, to validate the column stat of avro log files, we use > resource files storing the expected column stat table content for validation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4557) Support validation of column stats of avro log files in tests
[ https://issues.apache.org/jira/browse/HUDI-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4557: - Issue Type: Test (was: Improvement) > Support validation of column stats of avro log files in tests > - > > Key: HUDI-4557 > URL: https://issues.apache.org/jira/browse/HUDI-4557 > Project: Apache Hudi > Issue Type: Test > Components: tests-ci >Reporter: Ethan Guo >Priority: Major > Fix For: 0.13.1 > > > In TestColumnStatsIndex, when comparing the column stats with the actual data > files, only parquet files are supported. We need to support avro log files > as well. Note that, to validate the column stat of avro log files, we use > resource files storing the expected column stat table content for validation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2782) Fix marker based strategy for structured streaming
[ https://issues.apache.org/jira/browse/HUDI-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2782: - Issue Type: Bug (was: Improvement) > Fix marker based strategy for structured streaming > -- > > Key: HUDI-2782 > URL: https://issues.apache.org/jira/browse/HUDI-2782 > Project: Apache Hudi > Issue Type: Bug >Reporter: sivabalan narayanan >Priority: Major > Fix For: 0.13.1 > > > As part of [this|https://github.com/apache/hudi/pull/3967] patch, we are > making timeline server based as the default marker type. But we have an issue > w/ structured streaming. Looks like after 1st micro batch, the timeline > server gets shutdown and for subsequent micro batches, timeline server is not > available. So, in the patch we have made marker based overridden just for > structured streaming. > > We may want to revisit this and see how to go about it. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2782) Fix marker based strategy for structured streaming
[ https://issues.apache.org/jira/browse/HUDI-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2782: - Fix Version/s: 0.12.3 > Fix marker based strategy for structured streaming > -- > > Key: HUDI-2782 > URL: https://issues.apache.org/jira/browse/HUDI-2782 > Project: Apache Hudi > Issue Type: Bug >Reporter: sivabalan narayanan >Priority: Major > Fix For: 0.13.1, 0.12.3 > > > As part of [this|https://github.com/apache/hudi/pull/3967] patch, we are > making timeline server based as the default marker type. But we have an issue > w/ structured streaming. Looks like after 1st micro batch, the timeline > server gets shutdown and for subsequent micro batches, timeline server is not > available. So, in the patch we have made marker based overridden just for > structured streaming. > > We may want to revisit this and see how to go about it. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2506) Hudi dependency governance
[ https://issues.apache.org/jira/browse/HUDI-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2506: - Fix Version/s: 0.14.0 > Hudi dependency governance > -- > > Key: HUDI-2506 > URL: https://issues.apache.org/jira/browse/HUDI-2506 > Project: Apache Hudi > Issue Type: Test > Components: dependencies, Usability >Reporter: vinoyang >Assignee: Lokesh Jain >Priority: Critical > Fix For: 0.13.1, 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2782) Fix marker based strategy for structured streaming
[ https://issues.apache.org/jira/browse/HUDI-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2782: - Priority: Critical (was: Major) > Fix marker based strategy for structured streaming > -- > > Key: HUDI-2782 > URL: https://issues.apache.org/jira/browse/HUDI-2782 > Project: Apache Hudi > Issue Type: Bug >Reporter: sivabalan narayanan >Priority: Critical > Fix For: 0.13.1, 0.12.3 > > > As part of [this|https://github.com/apache/hudi/pull/3967] patch, we are > making timeline server based as the default marker type. But we have an issue > w/ structured streaming. Looks like after 1st micro batch, the timeline > server gets shutdown and for subsequent micro batches, timeline server is not > available. So, in the patch we have made marker based overridden just for > structured streaming. > > We may want to revisit this and see how to go about it. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2506) Hudi dependency governance
[ https://issues.apache.org/jira/browse/HUDI-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2506: - Issue Type: Test (was: Improvement) > Hudi dependency governance > -- > > Key: HUDI-2506 > URL: https://issues.apache.org/jira/browse/HUDI-2506 > Project: Apache Hudi > Issue Type: Test > Components: dependencies, Usability >Reporter: vinoyang >Assignee: Lokesh Jain >Priority: Major > Fix For: 0.13.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2506) Hudi dependency governance
[ https://issues.apache.org/jira/browse/HUDI-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2506: - Priority: Critical (was: Major) > Hudi dependency governance > -- > > Key: HUDI-2506 > URL: https://issues.apache.org/jira/browse/HUDI-2506 > Project: Apache Hudi > Issue Type: Test > Components: dependencies, Usability >Reporter: vinoyang >Assignee: Lokesh Jain >Priority: Critical > Fix For: 0.13.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5721) Add Github actions on more validations
[ https://issues.apache.org/jira/browse/HUDI-5721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-5721: - Priority: Blocker (was: Critical) > Add Github actions on more validations > -- > > Key: HUDI-5721 > URL: https://issues.apache.org/jira/browse/HUDI-5721 > Project: Apache Hudi > Issue Type: Test >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 0.13.1, 0.12.3 > > > Add the following validation from source release validation to Github actions: > * Binary files should not be present > * DISCLAIMER file should not be present > * LICENSE and NOTICE should exist > * Licensing check > * RAT check -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5721) Add Github actions on more validations
[ https://issues.apache.org/jira/browse/HUDI-5721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-5721: - Fix Version/s: 0.12.3 > Add Github actions on more validations > -- > > Key: HUDI-5721 > URL: https://issues.apache.org/jira/browse/HUDI-5721 > Project: Apache Hudi > Issue Type: Test >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Critical > Labels: pull-request-available > Fix For: 0.13.1, 0.12.3 > > > Add the following validation from source release validation to Github actions: > * Binary files should not be present > * DISCLAIMER file should not be present > * LICENSE and NOTICE should exist > * Licensing check > * RAT check -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-5794) Fail any new commits if there is any inflight restore in timeline
[ https://issues.apache.org/jira/browse/HUDI-5794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-5794. Resolution: Fixed > Fail any new commits if there is any inflight restore in timeline > - > > Key: HUDI-5794 > URL: https://issues.apache.org/jira/browse/HUDI-5794 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Critical > Labels: pull-request-available > Fix For: 0.13.1, 0.12.3 > > > if restore failed mid-way, users should not be allowed to start new commits. > lets add a guard rail around that. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5794) Fail any new commits if there is any inflight restore in timeline
[ https://issues.apache.org/jira/browse/HUDI-5794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-5794: - Fix Version/s: 0.12.3 > Fail any new commits if there is any inflight restore in timeline > - > > Key: HUDI-5794 > URL: https://issues.apache.org/jira/browse/HUDI-5794 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Critical > Labels: pull-request-available > Fix For: 0.13.1, 0.12.3 > > > if restore failed mid-way, users should not be allowed to start new commits. > lets add a guard rail around that. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5721) Add Github actions on more validations
[ https://issues.apache.org/jira/browse/HUDI-5721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-5721: - Issue Type: Test (was: Improvement) > Add Github actions on more validations > -- > > Key: HUDI-5721 > URL: https://issues.apache.org/jira/browse/HUDI-5721 > Project: Apache Hudi > Issue Type: Test >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Critical > Labels: pull-request-available > Fix For: 0.13.1 > > > Add the following validation from source release validation to Github actions: > * Binary files should not be present > * DISCLAIMER file should not be present > * LICENSE and NOTICE should exist > * Licensing check > * RAT check -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5612) Integrate metadata table with SpillableMapBasedFileSystemView and RocksDbBasedFileSystemView
[ https://issues.apache.org/jira/browse/HUDI-5612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-5612: - Component/s: metadata > Integrate metadata table with SpillableMapBasedFileSystemView and > RocksDbBasedFileSystemView > > > Key: HUDI-5612 > URL: https://issues.apache.org/jira/browse/HUDI-5612 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata >Reporter: Ethan Guo >Priority: Critical > Fix For: 0.13.1 > > > Currently, metadata-table-based file listing is integrated through > HoodieMetadataFileSystemView. SpillableMapBasedFileSystemView (storage type > of SPILLABLE_DISK) and RocksDbBasedFileSystemView (storage type of > EMBEDDED_KV_STORE) are independent of HoodieMetadataFileSystemView, and these > two file system view cannot leverage metadata-table-based file listing. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5611) Revisit metadata-table-based file listing calls and use batch lookup instead
[ https://issues.apache.org/jira/browse/HUDI-5611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-5611: - Component/s: metadata > Revisit metadata-table-based file listing calls and use batch lookup instead > > > Key: HUDI-5611 > URL: https://issues.apache.org/jira/browse/HUDI-5611 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata >Reporter: Ethan Guo >Priority: Critical > Fix For: 0.13.1 > > > We discover a performance issue with savepoint when the metadata table is > enabled. It is due to unnecessary scanning of the metadata table when the > number of partitions is large. When the metadata table is enabled, in the > savepoint operation, for each partition, the metadata table is scanned, which > leads to a lot of S3 requests. The solution is to batch the list calls of > all partitions (HUDI-5485). > > We need to revisit metadata-table-based file listing calls in a similar > fashion and replace them with batch lookup if needed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi
nsivabalan commented on code in PR #8107: URL: https://github.com/apache/hudi/pull/8107#discussion_r1131847074 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestAutoGenerationOfRecordKeys.scala: ## @@ -0,0 +1,282 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.functional + +import org.apache.hadoop.fs.FileSystem +import org.apache.hudi.HoodieConversionUtils.toJavaOption +import org.apache.hudi.common.config.HoodieMetadataConfig +import org.apache.hudi.common.model.{HoodieRecord, HoodieTableType, WriteOperationType} +import org.apache.hudi.common.model.HoodieRecord.HoodieRecordType +import org.apache.hudi.common.table.HoodieTableConfig +import org.apache.hudi.common.testutils.RawTripTestPayload.recordsToStrings +import org.apache.hudi.common.util +import org.apache.hudi.config.HoodieWriteConfig +import org.apache.hudi.exception.ExceptionUtil.getRootCause +import org.apache.hudi.exception.HoodieException +import org.apache.hudi.functional.CommonOptionUtils._ +import org.apache.hudi.keygen.constant.KeyGeneratorOptions +import org.apache.hudi.keygen.{ComplexKeyGenerator, NonpartitionedKeyGenerator, SimpleKeyGenerator, TimestampBasedKeyGenerator} +import org.apache.hudi.keygen.constant.KeyGeneratorOptions.Config +import org.apache.hudi.testutils.HoodieSparkClientTestBase +import org.apache.hudi.util.JFunction +import org.apache.hudi.{DataSourceWriteOptions, HoodieDataSourceHelpers, ScalaAssertionSupport} +import org.apache.spark.sql.hudi.HoodieSparkSessionExtension +import org.apache.spark.sql.{SaveMode, SparkSession, SparkSessionExtensions} +import org.junit.jupiter.api.Assertions.{assertEquals, assertTrue} +import org.junit.jupiter.api.{AfterEach, BeforeEach, Test} +import org.junit.jupiter.params.ParameterizedTest +import org.junit.jupiter.params.provider.{CsvSource, EnumSource} + +import java.util.function.Consumer +import scala.collection.JavaConversions._ +import scala.collection.JavaConverters._ + +class TestAutoGenerationOfRecordKeys extends HoodieSparkClientTestBase with ScalaAssertionSupport { + var spark: SparkSession = null Review Comment: this will be set in BeforeEach method. we don't have any code paths were this might be null. I don't think we need to add Option here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi
nsivabalan commented on code in PR #8107: URL: https://github.com/apache/hudi/pull/8107#discussion_r1131845834 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -1096,31 +1104,47 @@ object HoodieSparkSqlWriter { Some(writerSchema)) avroRecords.mapPartitions(it => { + val sparkPartitionId = TaskContext.getPartitionId() + val dataFileSchema = new Schema.Parser().parse(dataFileSchemaStr) val consistentLogicalTimestampEnabled = parameters.getOrElse( DataSourceWriteOptions.KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED.key(), DataSourceWriteOptions.KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED.defaultValue()).toBoolean - it.map { avroRecord => + // generate record keys is auto generation is enabled. + val recordsWithRecordKeyOverride = mayBeAutoGenerateRecordKeys(autoGenerateRecordKeys, it, instantTime) + + // handle dropping partition columns + recordsWithRecordKeyOverride.map { avroRecordRecordKeyOverRide => val processedRecord = if (shouldDropPartitionColumns) { - HoodieAvroUtils.rewriteRecord(avroRecord, dataFileSchema) + HoodieAvroUtils.rewriteRecord(avroRecordRecordKeyOverRide._1, dataFileSchema) +} else { + avroRecordRecordKeyOverRide._1 +} + +// Generate HoodieKey for records +val hoodieKey = if (autoGenerateRecordKeys) { + // fetch record key from the recordKeyOverride if auto generation is enabled. + new HoodieKey(avroRecordRecordKeyOverRide._2.get, keyGenerator.getKey(avroRecordRecordKeyOverRide._1).getPartitionPath) Review Comment: Since we have plans to fix this w/ https://github.com/apache/hudi/pull/7699 HUDI-5535, I don't want to add additional apis to the base interface/abstract class for now. lets revisit holistically. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi
nsivabalan commented on code in PR #8107: URL: https://github.com/apache/hudi/pull/8107#discussion_r1131845834 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -1096,31 +1104,47 @@ object HoodieSparkSqlWriter { Some(writerSchema)) avroRecords.mapPartitions(it => { + val sparkPartitionId = TaskContext.getPartitionId() + val dataFileSchema = new Schema.Parser().parse(dataFileSchemaStr) val consistentLogicalTimestampEnabled = parameters.getOrElse( DataSourceWriteOptions.KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED.key(), DataSourceWriteOptions.KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED.defaultValue()).toBoolean - it.map { avroRecord => + // generate record keys is auto generation is enabled. + val recordsWithRecordKeyOverride = mayBeAutoGenerateRecordKeys(autoGenerateRecordKeys, it, instantTime) + + // handle dropping partition columns + recordsWithRecordKeyOverride.map { avroRecordRecordKeyOverRide => val processedRecord = if (shouldDropPartitionColumns) { - HoodieAvroUtils.rewriteRecord(avroRecord, dataFileSchema) + HoodieAvroUtils.rewriteRecord(avroRecordRecordKeyOverRide._1, dataFileSchema) +} else { + avroRecordRecordKeyOverRide._1 +} + +// Generate HoodieKey for records +val hoodieKey = if (autoGenerateRecordKeys) { + // fetch record key from the recordKeyOverride if auto generation is enabled. + new HoodieKey(avroRecordRecordKeyOverRide._2.get, keyGenerator.getKey(avroRecordRecordKeyOverRide._1).getPartitionPath) Review Comment: yes. https://github.com/apache/hudi/pull/7699 HUDI-5535 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi
nsivabalan commented on code in PR #8107: URL: https://github.com/apache/hudi/pull/8107#discussion_r1131845254 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala: ## @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi + +import org.apache.avro.generic.GenericRecord +import org.apache.hudi.DataSourceWriteOptions.INSERT_DROP_DUPS +import org.apache.hudi.common.config.HoodieConfig +import org.apache.hudi.common.model.{HoodieRecord, WriteOperationType} +import org.apache.hudi.common.table.HoodieTableConfig +import org.apache.hudi.config.HoodieWriteConfig +import org.apache.hudi.exception.HoodieException +import org.apache.spark.TaskContext + +object AutoRecordKeyGenerationUtils { + + // supported operation types when auto generation of record keys is enabled. + val supportedOperations: Set[String] = +Set(WriteOperationType.INSERT, WriteOperationType.BULK_INSERT, WriteOperationType.DELETE, + WriteOperationType.INSERT_OVERWRITE, WriteOperationType.INSERT_OVERWRITE_TABLE, + WriteOperationType.DELETE_PARTITION).map(_.name()) + + def validateParamsForAutoGenerationOfRecordKeys(parameters: Map[String, String], + operation: WriteOperationType, hoodieConfig: HoodieConfig): Unit = { +val autoGenerateRecordKeys: Boolean = parameters.getOrElse(HoodieTableConfig.AUTO_GENERATE_RECORD_KEYS.key(), + HoodieTableConfig.AUTO_GENERATE_RECORD_KEYS.defaultValue()).toBoolean + +if (autoGenerateRecordKeys) { + // check for supported operations. + if (!supportedOperations.contains(operation.name())) { +throw new HoodieException(operation.name() + " is not supported with Auto generation of record keys. " + + "Supported operations are : " + supportedOperations) + } + // de-dup is not supported with auto generation of record keys + if (parameters.getOrElse(HoodieWriteConfig.COMBINE_BEFORE_INSERT.key(), +HoodieWriteConfig.COMBINE_BEFORE_INSERT.defaultValue()).toBoolean) { +throw new HoodieException("Enabling " + HoodieWriteConfig.COMBINE_BEFORE_INSERT.key() + " is not supported with auto generation of record keys "); + } + // drop dupes is not supported + if (hoodieConfig.getBoolean(INSERT_DROP_DUPS)) { +throw new HoodieException("Enabling " + INSERT_DROP_DUPS.key() + " is not supported with auto generation of record keys "); + } + // virtual keys are not supported with auto generation of record keys. + if (!parameters.getOrElse(HoodieTableConfig.POPULATE_META_FIELDS.key(), HoodieTableConfig.POPULATE_META_FIELDS.defaultValue().toString).toBoolean) { +throw new HoodieException("Disabling " + HoodieTableConfig.POPULATE_META_FIELDS.key() + " is not supported with auto generation of record keys"); + } +} + } + + /** + * Auto Generate record keys when auto generation config is enabled. + * + * Generated keys will be unique not only w/in provided [[org.apache.spark.sql.DataFrame]], but + * globally unique w/in the target table + * Generated keys have minimal overhead (to compute, persist and read) + * + * + * Keys adhere to the following format: + * + * [instantTime]_[PartitionId]_[RowId] + * + * where + * instantTime refers to the commit time of the batch being ingested. + * PartitionId refers to spark's partition Id. + * RowId refers to the row index within the spark partition. + * + * @param autoGenerateKeys true if auto generation of record keys is enabled. false otherwise. + * @param genRecsItr Iterator of GenericRecords. + * @param instantTime commit time of the batch. + * @return Iterator of Pair of GenericRecord and Optionally generated record key. + */ + def mayBeAutoGenerateRecordKeys(autoGenerateKeys : Boolean, genRecsItr: Iterator[GenericRecord], instantTime: String): Iterator[(GenericRecord, Option[String])] = { +var rowId = 0 +val sparkPartitionId = TaskContext.getPartitionId() + +// we will override record keys if auto generation if keys is enabled. +genRecsItr.map(avroReco
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi
nsivabalan commented on code in PR #8107: URL: https://github.com/apache/hudi/pull/8107#discussion_r1131844741 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala: ## @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi + +import org.apache.avro.generic.GenericRecord +import org.apache.hudi.DataSourceWriteOptions.INSERT_DROP_DUPS +import org.apache.hudi.common.config.HoodieConfig +import org.apache.hudi.common.model.{HoodieRecord, WriteOperationType} +import org.apache.hudi.common.table.HoodieTableConfig +import org.apache.hudi.config.HoodieWriteConfig +import org.apache.hudi.exception.HoodieException +import org.apache.spark.TaskContext + +object AutoRecordKeyGenerationUtils { + + // supported operation types when auto generation of record keys is enabled. + val supportedOperations: Set[String] = +Set(WriteOperationType.INSERT, WriteOperationType.BULK_INSERT, WriteOperationType.DELETE, + WriteOperationType.INSERT_OVERWRITE, WriteOperationType.INSERT_OVERWRITE_TABLE, + WriteOperationType.DELETE_PARTITION).map(_.name()) + + def validateParamsForAutoGenerationOfRecordKeys(parameters: Map[String, String], + operation: WriteOperationType, hoodieConfig: HoodieConfig): Unit = { +val autoGenerateRecordKeys: Boolean = parameters.getOrElse(HoodieTableConfig.AUTO_GENERATE_RECORD_KEYS.key(), + HoodieTableConfig.AUTO_GENERATE_RECORD_KEYS.defaultValue()).toBoolean + +if (autoGenerateRecordKeys) { + // check for supported operations. + if (!supportedOperations.contains(operation.name())) { +throw new HoodieException(operation.name() + " is not supported with Auto generation of record keys. " + + "Supported operations are : " + supportedOperations) + } + // de-dup is not supported with auto generation of record keys + if (parameters.getOrElse(HoodieWriteConfig.COMBINE_BEFORE_INSERT.key(), +HoodieWriteConfig.COMBINE_BEFORE_INSERT.defaultValue()).toBoolean) { +throw new HoodieException("Enabling " + HoodieWriteConfig.COMBINE_BEFORE_INSERT.key() + " is not supported with auto generation of record keys "); + } + // drop dupes is not supported + if (hoodieConfig.getBoolean(INSERT_DROP_DUPS)) { +throw new HoodieException("Enabling " + INSERT_DROP_DUPS.key() + " is not supported with auto generation of record keys "); + } + // virtual keys are not supported with auto generation of record keys. + if (!parameters.getOrElse(HoodieTableConfig.POPULATE_META_FIELDS.key(), HoodieTableConfig.POPULATE_META_FIELDS.defaultValue().toString).toBoolean) { +throw new HoodieException("Disabling " + HoodieTableConfig.POPULATE_META_FIELDS.key() + " is not supported with auto generation of record keys"); + } +} + } + + /** + * Auto Generate record keys when auto generation config is enabled. + * + * Generated keys will be unique not only w/in provided [[org.apache.spark.sql.DataFrame]], but + * globally unique w/in the target table + * Generated keys have minimal overhead (to compute, persist and read) + * + * + * Keys adhere to the following format: + * + * [instantTime]_[PartitionId]_[RowId] + * + * where + * instantTime refers to the commit time of the batch being ingested. + * PartitionId refers to spark's partition Id. + * RowId refers to the row index within the spark partition. + * + * @param autoGenerateKeys true if auto generation of record keys is enabled. false otherwise. + * @param genRecsItr Iterator of GenericRecords. + * @param instantTime commit time of the batch. + * @return Iterator of Pair of GenericRecord and Optionally generated record key. + */ + def mayBeAutoGenerateRecordKeys(autoGenerateKeys : Boolean, genRecsItr: Iterator[GenericRecord], instantTime: String): Iterator[(GenericRecord, Option[String])] = { +var rowId = 0 +val sparkPartitionId = TaskContext.getPartitionId() + +// we will override record keys if auto generation if keys is enabled. +genRecsItr.map(avroReco
[jira] [Updated] (HUDI-4245) Support nested fields in Column Stats Index
[ https://issues.apache.org/jira/browse/HUDI-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4245: - Component/s: metadata > Support nested fields in Column Stats Index > --- > > Key: HUDI-4245 > URL: https://issues.apache.org/jira/browse/HUDI-4245 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Fix For: 0.13.1 > > > Currently only root-level fields are supported in the Column Stats Index, > while there's no reason for us not to be able to support nested fields given > that columnar file formats store nested fields as _nested columns,_ ie as > columns with a name of the field and corresponding struct it attributes to. > > For example following schema: > {code:java} > c1: StringType > c2: StructType(Seq(StructField("foo", StringType))){code} > Would be stored in Parquet as "c1: string", "c2.foo: string", entailing that > Parquet actually already collects statistics for all the nested fields and we > just need to make sure we're propagating them into Column Stats Index > > Original GH issue: > [https://github.com/apache/hudi/issues/5804#issuecomment-1152983029] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi
nsivabalan commented on code in PR #8107: URL: https://github.com/apache/hudi/pull/8107#discussion_r1131842281 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala: ## @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi + +import org.apache.avro.generic.GenericRecord +import org.apache.hudi.DataSourceWriteOptions.INSERT_DROP_DUPS +import org.apache.hudi.common.config.HoodieConfig +import org.apache.hudi.common.model.{HoodieRecord, WriteOperationType} +import org.apache.hudi.common.table.HoodieTableConfig +import org.apache.hudi.config.HoodieWriteConfig +import org.apache.hudi.exception.HoodieException +import org.apache.spark.TaskContext + +object AutoRecordKeyGenerationUtils { + + // supported operation types when auto generation of record keys is enabled. + val supportedOperations: Set[String] = +Set(WriteOperationType.INSERT, WriteOperationType.BULK_INSERT, WriteOperationType.DELETE, + WriteOperationType.INSERT_OVERWRITE, WriteOperationType.INSERT_OVERWRITE_TABLE, + WriteOperationType.DELETE_PARTITION).map(_.name()) Review Comment: as called out in the docs, UPDATE and DELETE via spark-sql should be supported. That will be phase 2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org