Re: [I] ClassNotFoundException: MergeOnReadInputSplit [hudi]
ad1happy2go commented on issue #9474: URL: https://github.com/apache/hudi/issues/9474#issuecomment-1778613662 @jiangzzwy I tried the similar command and it worked for me. looks like some problem in your setup. Did you added the jar under $FLINK_HOME/lib. Let us know if you still faces this issue. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Add table name and range msg for streaming reads logs [hudi]
hudi-bot commented on PR #9912: URL: https://github.com/apache/hudi/pull/9912#issuecomment-1778609048 ## CI report: * aadc5fbc31b83cfff275fee66618071b0bc9e76d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20471) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6973] Instantiate HoodieFileGroupRecordBuffer inside new file group reader [hudi]
hudi-bot commented on PR #9910: URL: https://github.com/apache/hudi/pull/9910#issuecomment-1778608974 ## CI report: * f158692bc1611582566b3bbd76e49d07a290e802 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20447) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6975] Optimize the code of DayBasedCompactionStrategy [hudi]
ksmou commented on code in PR #9911: URL: https://github.com/apache/hudi/pull/9911#discussion_r1371222617 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/strategy/DayBasedCompactionStrategy.java: ## @@ -63,21 +60,9 @@ public Comparator getComparator() { return comparator; } - @Override - public List orderAndFilter(HoodieWriteConfig writeConfig, - List operations, List pendingCompactionPlans) { -// Iterate through the operations and accept operations as long as we are within the configured target partitions -// limit -return operations.stream() - .collect(Collectors.groupingBy(HoodieCompactionOperation::getPartitionPath)).entrySet().stream() - .sorted(Map.Entry.comparingByKey(comparator)).limit(writeConfig.getTargetPartitionsPerDayBasedCompaction()) -.flatMap(e -> e.getValue().stream()).collect(Collectors.toList()); - } - @Override public List filterPartitionPaths(HoodieWriteConfig writeConfig, List allPartitionPaths) { -return allPartitionPaths.stream().map(partition -> partition.replace("/", "-")) -.sorted(Comparator.reverseOrder()).map(partitionPath -> partitionPath.replace("-", "/")) +return allPartitionPaths.stream().sorted(comparator) .collect(Collectors.toList()).subList(0, Math.min(allPartitionPaths.size(), Review Comment: If the original size `allPartitionPaths.size()` is smaller than `writeConfig.getTargetPartitionsPerDayBasedCompaction()`, only `subList(writeConfig.getTargetPartitionsPerDayBasedCompaction())` will throw IndexOutOfBoundsException. I think we can use `limit` to replace `subList()`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Parquet files got cleaned up even when cleaning operation failed hence leading to subsequent failed clustering and cleaning [hudi]
ad1happy2go commented on issue #9257: URL: https://github.com/apache/hudi/issues/9257#issuecomment-1778596706 @adityaverma1997 Sorry for all the delay's here. I did try to reproduce this couple of times but never got any error. Tried to mock up some failures too when cleaning happens. Actually it depends on when the cleaning exactly fails. Are you able to reproduce this consistently? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] AWS Glue Sync fails on a Hudi table with > 25 partitions [hudi]
codope closed issue #9806: [SUPPORT] AWS Glue Sync fails on a Hudi table with > 25 partitions URL: https://github.com/apache/hudi/issues/9806 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(Lscala/PartialFunction;)Lorg/apache/spark/sql/catalyst/p
ad1happy2go commented on issue #8614: URL: https://github.com/apache/hudi/issues/8614#issuecomment-1778570591 @danny0405 I think the issue is `org.apache.hudi:hudi-utilities-bundle_2.12:0.13.1` . As utilities bundle jar can't have each spark version specific dependency. So dont use the maven one and either try to build your own jar and use that. OR use the slim-bundle package. We should not use both utilities-bundle and spark bundle together. utilities-bundle already have spark-bundle dependency. So ideally use utilities slim bundle. @pushpavanthar I did asked you to try the same on this slack thread - https://apache-hudi.slack.com/archives/C4D716NPQ/p1697802409713149. Were you able to try out this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (051eb0e930e -> 98d956fd845)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 051eb0e930e [MINOR] Add tests on combine parallelism (#9731) add 98d956fd845 [HUDI-6977] Upgrade hadoop version from 2.10.1 to 2.10.2 (#9914) No new revisions were added by this update. Summary of changes: pom.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Re: [PR] [HUDI-6977] Upgrade hadoop version from 2.10.1 to 2.10.2 [hudi]
yihua merged PR #9914: URL: https://github.com/apache/hudi/pull/9914 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6973] Instantiate HoodieFileGroupRecordBuffer inside new file group reader [hudi]
hudi-bot commented on PR #9910: URL: https://github.com/apache/hudi/pull/9910#issuecomment-1778561669 ## CI report: * f158692bc1611582566b3bbd76e49d07a290e802 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Compaction error [hudi]
codope closed issue #9885: [SUPPORT] Compaction error URL: https://github.com/apache/hudi/issues/9885 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] AWS Glue Sync fails on a Hudi table with > 25 partitions [hudi]
ad1happy2go commented on issue #9806: URL: https://github.com/apache/hudi/issues/9806#issuecomment-1778552572 @buiducsinh34 @noahtaite Closing this out as PR is merged. Thanks Everybody. Feel free to reopen if you still see the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(Lscala/PartialFunction;)Lorg/apache/spark/sql/catalyst/p
pushpavanthar commented on issue #8614: URL: https://github.com/apache/hudi/issues/8614#issuecomment-1778547455 we tried running this on emr-6.7.0 and few other higher labels. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
yihua commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1371173652 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala: ## @@ -261,7 +262,8 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase with ScalaAssertionSuppo } test("Test MergeInto for MOR table ") { -withRecordType()(withTempDir {tmp => +spark.sql(s"set ${MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT.key} = 0") +withRecordType()(withTempDir { tmp => Review Comment: Yes, I'd like to make sure that my changes do not break MERGE INTO on MOR tables. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
danny0405 commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1371170669 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala: ## @@ -261,7 +262,8 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase with ScalaAssertionSuppo } test("Test MergeInto for MOR table ") { -withRecordType()(withTempDir {tmp => +spark.sql(s"set ${MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT.key} = 0") +withRecordType()(withTempDir { tmp => Review Comment: Got it, is it related with this change? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
yihua commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1371157540 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala: ## @@ -261,7 +262,8 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase with ScalaAssertionSuppo } test("Test MergeInto for MOR table ") { -withRecordType()(withTempDir {tmp => +spark.sql(s"set ${MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT.key} = 0") +withRecordType()(withTempDir { tmp => Review Comment: This is to ensure that for MOR table, log files are written. Otherwise, the MOR table generated by the test may not contain log files, which is not different than COW. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Control file sizing during FULL_RECORD bootstrap mode [hudi]
ad1happy2go commented on issue #9915: URL: https://github.com/apache/hudi/issues/9915#issuecomment-1778505679 @fenil25 bulk-insert operation doesn't handle the small file handling, that is why you see the file sizes equal to split size. Sp the total number of partitions is calculated as `number_of_files * number_of_blocks_in_file`. - One way to handle this case will be running clustering with proper configuration to achieve the correct size files. - The other way is to configure the spark configuration `spark.sql.files.maxPartitionBytes` while doing bulk-insert which is default 128 MB in spark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778503193 ## CI report: * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20467) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6960] Support read partition values from path when schema evolution enabled [hudi]
danny0405 commented on code in PR #9889: URL: https://github.com/apache/hudi/pull/9889#discussion_r1371143452 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala: ## @@ -65,8 +65,11 @@ case class BaseFileOnlyRelation(override val sqlContext: SQLContext, // For more details please check HUDI-4161 // NOTE: This override has to mirror semantic of whenever this Relation is converted into [[HadoopFsRelation]], // which is currently done for all cases, except when Schema Evolution is enabled - override protected val shouldExtractPartitionValuesFromPartitionPath: Boolean = - internalSchemaOpt.isEmpty + override protected val shouldExtractPartitionValuesFromPartitionPath: Boolean = { +if (hasSchemaOnRead) { + super.needExtractPartitionValuesFromPartitionPath() +} else true Review Comment: What is exact the behavior change in line 205, can you elaborate a little more? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6960] Support read partition values from path when schema evolution enabled [hudi]
danny0405 commented on code in PR #9889: URL: https://github.com/apache/hudi/pull/9889#discussion_r1371142864 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala: ## @@ -220,7 +220,9 @@ abstract class HoodieBaseRelation(val sqlContext: SQLContext, * partition path, meaning that string value of "2022/01/01" will be appended, and not its original * representation */ - protected val shouldExtractPartitionValuesFromPartitionPath: Boolean = { + protected val shouldExtractPartitionValuesFromPartitionPath: Boolean = needExtractPartitionValuesFromPartitionPath() + + protected def needExtractPartitionValuesFromPartitionPath(): Boolean = { // Controls whether partition columns (which are the source for the partition path values) should Review Comment: Why add a new method name? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6960] Support read partition values from path when schema evolution enabled [hudi]
danny0405 commented on code in PR #9889: URL: https://github.com/apache/hudi/pull/9889#discussion_r1371141863 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala: ## @@ -149,27 +152,10 @@ case class BaseFileOnlyRelation(override val sqlContext: SQLContext, val enableFileIndex = HoodieSparkConfUtils.getConfigValue(optParams, sparkSession.sessionState.conf, ENABLE_HOODIE_FILE_INDEX.key, ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean if (enableFileIndex && globPaths.isEmpty) { - // NOTE: There are currently 2 ways partition values could be fetched: - // - Source columns (producing the values used for physical partitioning) will be read - // from the data file - // - Values parsed from the actual partition path would be appended to the final dataset - // - //In the former case, we don't need to provide the partition-schema to the relation, - //therefore we simply stub it w/ empty schema and use full table-schema as the one being - //read from the data file. Review Comment: Can you add this details info as a comment there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6975] Optimize the code of DayBasedCompactionStrategy [hudi]
danny0405 commented on code in PR #9911: URL: https://github.com/apache/hudi/pull/9911#discussion_r1371139632 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/strategy/DayBasedCompactionStrategy.java: ## @@ -63,21 +60,9 @@ public Comparator getComparator() { return comparator; } - @Override - public List orderAndFilter(HoodieWriteConfig writeConfig, - List operations, List pendingCompactionPlans) { -// Iterate through the operations and accept operations as long as we are within the configured target partitions -// limit -return operations.stream() - .collect(Collectors.groupingBy(HoodieCompactionOperation::getPartitionPath)).entrySet().stream() - .sorted(Map.Entry.comparingByKey(comparator)).limit(writeConfig.getTargetPartitionsPerDayBasedCompaction()) -.flatMap(e -> e.getValue().stream()).collect(Collectors.toList()); - } - @Override public List filterPartitionPaths(HoodieWriteConfig writeConfig, List allPartitionPaths) { -return allPartitionPaths.stream().map(partition -> partition.replace("/", "-")) -.sorted(Comparator.reverseOrder()).map(partitionPath -> partitionPath.replace("-", "/")) +return allPartitionPaths.stream().sorted(comparator) .collect(Collectors.toList()).subList(0, Math.min(allPartitionPaths.size(), Review Comment: Why we subList its original size, I'm confused. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Add table name and range msg for streaming reads logs [hudi]
danny0405 commented on code in PR #9912: URL: https://github.com/apache/hudi/pull/9912#discussion_r1371137154 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadMonitoringFunction.java: ## @@ -226,9 +226,9 @@ public void monitorDirAndForwardSplits(SourceContext cont this.issuedOffset = result.getOffset(); LOG.info("\n" + "\n" -+ "-- consumed to instant: {}\n" ++ "-- consumed {} to instant: {}\n" + "", -this.issuedInstant); +conf.getString(FlinkOptions.TABLE_NAME), this.issuedInstant); Review Comment: I would like it to be: -- table: xxx -- consumed to instant: xxx -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Add table name and range msg for streaming reads logs [hudi]
danny0405 commented on code in PR #9912: URL: https://github.com/apache/hudi/pull/9912#discussion_r1371135529 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java: ## @@ -57,6 +59,15 @@ public String getEndInstant() { public abstract boolean isInRange(String instant); + @Override + public String toString() { +return "InstantRange{" Review Comment: The start or end range may be null. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]
danny0405 commented on code in PR #9892: URL: https://github.com/apache/hudi/pull/9892#discussion_r1371133664 ## hudi-common/src/main/java/org/apache/hudi/common/model/DefaultHoodieRecordPayload.java: ## @@ -86,30 +86,26 @@ public Option getInsertValue(Schema schema, Properties properties GenericRecord incomingRecord = HoodieAvroUtils.bytesToAvro(recordBytes, schema); eventTime = updateEventTime(incomingRecord, properties); -return isDeleteRecord(incomingRecord, properties) ? Option.empty() : Option.of(incomingRecord); +return isDeleted(schema, properties) ? Option.empty() : Option.of(incomingRecord); } - /** - * @param genericRecord instance of {@link GenericRecord} of interest. - * @param properties payload related properties - * @returns {@code true} if record represents a delete record. {@code false} otherwise. - */ - protected boolean isDeleteRecord(GenericRecord genericRecord, Properties properties) { -final String deleteKey = properties.getProperty(DELETE_KEY); + @Override + protected boolean isDeleteRecord(GenericRecord record, Properties props) { +final String deleteKey = props.getProperty(DELETE_KEY); if (StringUtils.isNullOrEmpty(deleteKey)) { - return isDeleteRecord(genericRecord); + return super.isDeleteRecord(record, props); Review Comment: Is this line the actualy fix, I didn't see the props got used by the super method, so do we still need to pass around all the props here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]
danny0405 commented on code in PR #9892: URL: https://github.com/apache/hudi/pull/9892#discussion_r1371132703 ## hudi-common/src/main/java/org/apache/hudi/common/model/DefaultHoodieRecordPayload.java: ## @@ -45,12 +45,12 @@ public class DefaultHoodieRecordPayload extends OverwriteWithLatestAvroPayload { public static final String DELETE_MARKER = "hoodie.payload.delete.marker"; private Option eventTime = Option.empty(); - public DefaultHoodieRecordPayload(GenericRecord record, Comparable orderingVal) { -super(record, orderingVal); + public DefaultHoodieRecordPayload(GenericRecord record, Comparable orderingVal, Properties props) { +super(record, orderingVal, props); } Review Comment: The source of the props seems a chaos, I already saw several ways how it was produced: 1. `config.getPayloadConfig().getProps()` in `HoodieMergeHandle`; 2. `payloadProps.setProperty(HoodiePayloadProps.PAYLOAD_ORDERING_FIELD_PROP_KEY, preCombineField);` in `HoodieFileSliceReader`; 3. `config.getProps()` in `HoodieIndexUtils`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]
danny0405 commented on code in PR #9892: URL: https://github.com/apache/hudi/pull/9892#discussion_r1371109726 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/PayloadCreation.java: ## @@ -43,14 +44,17 @@ public class PayloadCreation implements Serializable { private static final long serialVersionUID = 1L; private final boolean shouldCombine; + private final boolean shouldUsePropsForPayload; private final Constructor constructor; private final String preCombineField; private PayloadCreation( boolean shouldCombine, + boolean shouldUsePropsForPayload, Constructor constructor, @Nullable String preCombineField) { this.shouldCombine = shouldCombine; +this.shouldUsePropsForPayload = shouldUsePropsForPayload; Review Comment: `shouldUsePropsForPayload` should be always true? ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/PayloadCreation.java: ## @@ -60,34 +64,63 @@ public static PayloadCreation instance(Configuration conf) throws Exception { boolean needCombine = conf.getBoolean(FlinkOptions.PRE_COMBINE) || WriteOperationType.fromValue(conf.getString(FlinkOptions.OPERATION)) == WriteOperationType.UPSERT; boolean shouldCombine = needCombine && preCombineField != null; +boolean shouldUsePropsForPayload = true; -final Class[] argTypes; -final Constructor constructor; +Class[] argTypes; +Constructor constructor; if (shouldCombine) { - argTypes = new Class[] {GenericRecord.class, Comparable.class}; + argTypes = new Class[] {GenericRecord.class, Comparable.class, Properties.class}; } else { - argTypes = new Class[] {Option.class}; + argTypes = new Class[] {Option.class, Properties.class}; } final String clazz = conf.getString(FlinkOptions.PAYLOAD_CLASS_NAME); -constructor = ReflectionUtils.getClass(clazz).getConstructor(argTypes); -return new PayloadCreation(shouldCombine, constructor, preCombineField); +try { + constructor = ReflectionUtils.getClass(clazz).getConstructor(argTypes); +} catch (NoSuchMethodException e) { + shouldUsePropsForPayload = false; + if (shouldCombine) { +argTypes = new Class[] {GenericRecord.class, Comparable.class}; + } else { +argTypes = new Class[] {Option.class}; + } + constructor = ReflectionUtils.getClass(clazz).getConstructor(argTypes); +} +return new PayloadCreation(shouldCombine, shouldUsePropsForPayload, constructor, preCombineField); + } + + public static Properties extractPropsFromConfiguration(Configuration config) { +Properties props = new Properties(); Review Comment: If all we want is payload properties, you can use `StreamerUtil.getPayloadConfig`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Add table name and range msg for streaming reads logs [hudi]
hudi-bot commented on PR #9912: URL: https://github.com/apache/hudi/pull/9912#issuecomment-1778466955 ## CI report: * 7f6535290896455bb3312e7203f2eafa69109f05 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20457) * aadc5fbc31b83cfff275fee66618071b0bc9e76d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20471) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Add table name and range msg for streaming reads logs [hudi]
hudi-bot commented on PR #9912: URL: https://github.com/apache/hudi/pull/9912#issuecomment-1778462077 ## CI report: * 7f6535290896455bb3312e7203f2eafa69109f05 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20457) * aadc5fbc31b83cfff275fee66618071b0bc9e76d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778461940 ## CI report: * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN * d96a7423b1c1bae13148744547726ed95ee5c6b7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20465) * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20467) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
danny0405 commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1371105049 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala: ## @@ -261,7 +262,8 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase with ScalaAssertionSuppo } test("Test MergeInto for MOR table ") { -withRecordType()(withTempDir {tmp => +spark.sql(s"set ${MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT.key} = 0") +withRecordType()(withTempDir { tmp => Review Comment: Why this change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Add table name and range msg for streaming reads logs [hudi]
zhuanshenbsj1 commented on code in PR #9912: URL: https://github.com/apache/hudi/pull/9912#discussion_r1371102567 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java: ## @@ -34,10 +34,12 @@ public abstract class InstantRange implements Serializable { protected final String startInstant; protected final String endInstant; + protected final String rangeType; - public InstantRange(String startInstant, String endInstant) { + public InstantRange(String startInstant, String endInstant, String rangeType) { this.startInstant = startInstant; this.endInstant = endInstant; +this.rangeType = rangeType; Review Comment: Adjust as u say. ## hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java: ## @@ -57,6 +59,15 @@ public String getEndInstant() { public abstract boolean isInRange(String instant); + @Override + public String toString() { +return "InstantRange{" Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
hudi-bot commented on PR #9883: URL: https://github.com/apache/hudi/pull/9883#issuecomment-1778455957 ## CI report: * 985e9f099aff341d7d0cec4384ef82b7dcdd4de8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20469) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6960] Support read partition values from path when schema evolution enabled [hudi]
wecharyu commented on code in PR #9889: URL: https://github.com/apache/hudi/pull/9889#discussion_r1371097371 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala: ## @@ -149,27 +152,10 @@ case class BaseFileOnlyRelation(override val sqlContext: SQLContext, val enableFileIndex = HoodieSparkConfUtils.getConfigValue(optParams, sparkSession.sessionState.conf, ENABLE_HOODIE_FILE_INDEX.key, ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean if (enableFileIndex && globPaths.isEmpty) { - // NOTE: There are currently 2 ways partition values could be fetched: - // - Source columns (producing the values used for physical partitioning) will be read - // from the data file - // - Values parsed from the actual partition path would be appended to the final dataset - // - //In the former case, we don't need to provide the partition-schema to the relation, - //therefore we simply stub it w/ empty schema and use full table-schema as the one being - //read from the data file. Review Comment: Got your point. The change here is because baseRelation will be converted to HadoopFsRelation only when `baseRelation.hasSchemaOnRead` is **false**: https://github.com/apache/hudi/blob/65dd645b487a61fbca7e4e4b849d1f2f1ec143f9/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala#L328-L332 In this case `shouldExtractPartitionValuesFromPartitionPath` is true, this is just a code simplify. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6900) TestInsertTable "Test Bulk Insert Into Consistent Hashing Bucket Index Table" is failing continuously
[ https://issues.apache.org/jira/browse/HUDI-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6900. Fix Version/s: 1.0.0 Resolution: Fixed Fixed via master branch: 65dd645b487a61fbca7e4e4b849d1f2f1ec143f9 > TestInsertTable "Test Bulk Insert Into Consistent Hashing Bucket Index Table" > is failing continuously > - > > Key: HUDI-6900 > URL: https://issues.apache.org/jira/browse/HUDI-6900 > Project: Apache Hudi > Issue Type: Improvement > Components: spark-sql >Reporter: Danny Chen >Priority: Major > Fix For: 1.0.0 > > > The test is failing on travis CI but can not reproduce in local, need some > time to debug the reasons. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [MINOR] Add tests on combine parallelism (#9731)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 051eb0e930e [MINOR] Add tests on combine parallelism (#9731) 051eb0e930e is described below commit 051eb0e930e983dd4118abec01e10d9b01f91ca0 Author: Y Ethan Guo AuthorDate: Tue Oct 24 20:19:08 2023 -0700 [MINOR] Add tests on combine parallelism (#9731) --- .../hudi/table/action/commit/BaseWriteHelper.java | 11 +-- .../table/action/commit/TestWriterHelperBase.java | 90 ++ .../table/action/commit/TestSparkWriteHelper.java | 76 ++ .../common/testutils/HoodieCommonTestHarness.java | 11 ++- 4 files changed, 180 insertions(+), 8 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseWriteHelper.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseWriteHelper.java index 8d8978927f6..b5edc7878f9 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseWriteHelper.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseWriteHelper.java @@ -27,7 +27,6 @@ import org.apache.hudi.common.util.HoodieRecordUtils; import org.apache.hudi.exception.HoodieUpsertException; import org.apache.hudi.index.HoodieIndex; import org.apache.hudi.table.HoodieTable; - import org.apache.hudi.table.action.HoodieWriteMetadata; import java.time.Duration; @@ -48,12 +47,9 @@ public abstract class BaseWriteHelper extends ParallelismHelper executor, WriteOperationType operationType) { try { - int targetParallelism = - deduceShuffleParallelism(inputRecords, configuredShuffleParallelism); - // De-dupe/merge if needed I dedupedRecords = - combineOnCondition(shouldCombine, inputRecords, targetParallelism, table); + combineOnCondition(shouldCombine, inputRecords, configuredShuffleParallelism, table); Instant lookupBegin = Instant.now(); I taggedRecords = dedupedRecords; @@ -79,8 +75,9 @@ public abstract class BaseWriteHelper extends ParallelismHelper table); public I combineOnCondition( - boolean condition, I records, int parallelism, HoodieTable table) { -return condition ? deduplicateRecords(records, table, parallelism) : records; + boolean condition, I records, int configuredParallelism, HoodieTable table) { +int targetParallelism = deduceShuffleParallelism(records, configuredParallelism); +return condition ? deduplicateRecords(records, table, targetParallelism) : records; } /** diff --git a/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/table/action/commit/TestWriterHelperBase.java b/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/table/action/commit/TestWriterHelperBase.java new file mode 100644 index 000..2d43b414608 --- /dev/null +++ b/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/table/action/commit/TestWriterHelperBase.java @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.table.action.commit; + +import org.apache.hudi.common.data.HoodieData; +import org.apache.hudi.common.engine.HoodieEngineContext; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.testutils.HoodieCommonTestHarness; +import org.apache.hudi.table.HoodieTable; + +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.params.ParameterizedTest; +import org.junit.jupiter.params.provider.CsvSource; + +import java.io.IOException; +import java.util.List; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +/** + * Tests for write helpers + */ +public abstract class TestWriterHelperBase extends HoodieCommonTestHarness { + private static int runNo = 0; + protected final BaseWriteHelper writeHelper; + protected HoodieEngineContext context; + protected HoodieTable table; + protected I inputRecord
Re: [PR] [MINOR] Add tests on combine parallelism [hudi]
nsivabalan merged PR #9731: URL: https://github.com/apache/hudi/pull/9731 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]
nsivabalan commented on code in PR #9892: URL: https://github.com/apache/hudi/pull/9892#discussion_r1371085714 ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroPayload.java: ## @@ -39,11 +42,19 @@ public class HoodieAvroPayload implements HoodieRecordPayload private final Comparable orderingVal; public HoodieAvroPayload(GenericRecord record, Comparable orderingVal) { +this(record, orderingVal, EMPTY_PROPS); Review Comment: shouldn't we mark these as deprecated ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6877] Fix avro read issue after ALTER TABLE RENAME DDL on Spark3_1 [hudi]
voonhous commented on code in PR #9752: URL: https://github.com/apache/hudi/pull/9752#discussion_r1371085479 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieDataBlock.java: ## @@ -115,6 +114,35 @@ public byte[] getContentBytes() throws IOException { return serializeRecords(records.get()); } + private Schema getReaderSchema(Option readerSchemaOpt) { +Schema writerSchema = getWriterSchema(super.getLogBlockHeader()); +// If no reader-schema has been provided assume writer-schema as one +if (!readerSchemaOpt.isPresent()) { + return writerSchema; +} + +// Handle table renames when there are still log files +Schema readerSchema = readerSchemaOpt.get(); +if (isHandleDifferingNamespaceRequired(readerSchema, writerSchema)) { + return writerSchema; +} else { + return readerSchema; +} + } + + /** + * Spark3.1 uses avro:1.8.2, which matches fields by their fully qualified name. If namespaces are differs, reads will fail for fields that have the same name and type, but differing name(spaces). + * Such cases can arise when an ALTER-TABLE-RENAME ddl is performed. + * + * @param readerSchema the reader schema + * @param writerSchema the writer schema + * @return boolean if handling of differing namespaces between reader and writer schema are required + */ + private static boolean isHandleDifferingNamespaceRequired(Schema readerSchema, Schema writerSchema) { +return readerSchema.getClass().getPackage().getImplementationVersion().compareTo("1.8.2") <= 0 +&& !readerSchema.getName().equals(writerSchema.getName()); + } Review Comment: Agreed, the fix here was written with backwards compatibility in mind. For tables that already have log files that were written in a certain format + namespace and have yet to be compacted, it is not realistic to modify those name spaces block by block. Given that this is an avro internal issue for lower avro version, and will only happen when performing merge. I thought it will be appropriate to put it here while reading log files. There are a few ways to fix this like you mentioned and they are: 1. `ALTER-TABLE-RENAME-DDL` will rename the hive table name, but will not perform any internal renames. i.e. `hoodie.properties` will not be re-written, schema will hence be consistent. 2. Block `ALTER-TABLE-RENAME-DDL` if there are uncompacted log files; ALTER-TABLE-RENAME-DDL will be IO intensive, need to scan through all latest file slices to see if there are any uncompacted filegroups. log file list > 0 (akin to running a plan-compaction execution, and if compaction operations is not empty, we will reject the rename) 3. Fix it while reading as seen in the fix here (code is ugly and we are coupling business logic + with dependency version) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6975] Optimize the code of DayBasedCompactionStrategy [hudi]
ksmou commented on code in PR #9911: URL: https://github.com/apache/hudi/pull/9911#discussion_r1371082887 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/strategy/DayBasedCompactionStrategy.java: ## @@ -63,21 +60,9 @@ public Comparator getComparator() { return comparator; } - @Override - public List orderAndFilter(HoodieWriteConfig writeConfig, - List operations, List pendingCompactionPlans) { -// Iterate through the operations and accept operations as long as we are within the configured target partitions -// limit -return operations.stream() - .collect(Collectors.groupingBy(HoodieCompactionOperation::getPartitionPath)).entrySet().stream() - .sorted(Map.Entry.comparingByKey(comparator)).limit(writeConfig.getTargetPartitionsPerDayBasedCompaction()) -.flatMap(e -> e.getValue().stream()).collect(Collectors.toList()); - } - @Override public List filterPartitionPaths(HoodieWriteConfig writeConfig, List allPartitionPaths) { -return allPartitionPaths.stream().map(partition -> partition.replace("/", "-")) -.sorted(Comparator.reverseOrder()).map(partitionPath -> partitionPath.replace("-", "/")) +return allPartitionPaths.stream().sorted(comparator) .collect(Collectors.toList()).subList(0, Math.min(allPartitionPaths.size(), Review Comment: The test failures are not related to this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6979) support EventTimeBasedCompactionStrategy
Kong Wei created HUDI-6979: -- Summary: support EventTimeBasedCompactionStrategy Key: HUDI-6979 URL: https://issues.apache.org/jira/browse/HUDI-6979 Project: Apache Hudi Issue Type: New Feature Components: compaction Reporter: Kong Wei Assignee: Kong Wei The current compaction strategies are based on the logfile size, the number of logfile files, etc. The data time of the RO table generated by these strategies is uncontrollable. Hudi also has a DayBased strategy, but it relies on day based partition path and the time granularity is coarse. The *EventTimeBasedCompactionStrategy* strategy can generate event time-friendly RO tables, whether it is day based partition or not. For example, the strategy can select all logfiles whose data time is before 3 am for compaction, so that the generated RO table data is before 3 am. If we just want to query data before 3 am, we can just query the RO table which is much faster. With the strategy, I think we can expand the application scenarios of RO tables. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6969] Add speed limit for stream read [hudi]
danny0405 commented on code in PR #9904: URL: https://github.com/apache/hudi/pull/9904#discussion_r137106 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java: ## @@ -269,6 +269,9 @@ public Result inputSplits( Result hollowSplits = getHollowInputSplits(metaClient, metaClient.getHadoopConf(), issuedInstant, issuedOffset, commitTimeline, cdcEnabled); List instants = filterInstantsWithRange(commitTimeline, issuedInstant); +int instantLimit = this.conf.getInteger(FlinkOptions.READ_COMMITS_LIMIT,Integer.MAX_VALUE); +instants = instants.subList(0, Math.min(instantLimit, instants.size())); + Review Comment: Can we add some tests for it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-6968) remove block logical in BulkInsertWriteFunction#open
[ https://issues.apache.org/jira/browse/HUDI-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779295#comment-17779295 ] Jing Zhang commented on HUDI-6968: -- Fixed via master branch: f05b5fc9db38e0bc4ccc2941cccf049991b67db2 > remove block logical in BulkInsertWriteFunction#open > > > Key: HUDI-6968 > URL: https://issues.apache.org/jira/browse/HUDI-6968 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jing Zhang >Priority: Trivial > > See more discussion in [PR9896|https://github.com/apache/hudi/pull/9896]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6968) remove block logical in BulkInsertWriteFunction#open
[ https://issues.apache.org/jira/browse/HUDI-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhang closed HUDI-6968. Fix Version/s: 1.0.0 Resolution: Fixed > remove block logical in BulkInsertWriteFunction#open > > > Key: HUDI-6968 > URL: https://issues.apache.org/jira/browse/HUDI-6968 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jing Zhang >Priority: Trivial > Fix For: 1.0.0 > > > See more discussion in [PR9896|https://github.com/apache/hudi/pull/9896]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6975] Optimize the code of DayBasedCompactionStrategy [hudi]
danny0405 commented on code in PR #9911: URL: https://github.com/apache/hudi/pull/9911#discussion_r1371010223 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/strategy/DayBasedCompactionStrategy.java: ## @@ -63,21 +60,9 @@ public Comparator getComparator() { return comparator; } - @Override - public List orderAndFilter(HoodieWriteConfig writeConfig, - List operations, List pendingCompactionPlans) { -// Iterate through the operations and accept operations as long as we are within the configured target partitions -// limit -return operations.stream() - .collect(Collectors.groupingBy(HoodieCompactionOperation::getPartitionPath)).entrySet().stream() - .sorted(Map.Entry.comparingByKey(comparator)).limit(writeConfig.getTargetPartitionsPerDayBasedCompaction()) -.flatMap(e -> e.getValue().stream()).collect(Collectors.toList()); - } - @Override public List filterPartitionPaths(HoodieWriteConfig writeConfig, List allPartitionPaths) { -return allPartitionPaths.stream().map(partition -> partition.replace("/", "-")) -.sorted(Comparator.reverseOrder()).map(partitionPath -> partitionPath.replace("-", "/")) +return allPartitionPaths.stream().sorted(comparator) .collect(Collectors.toList()).subList(0, Math.min(allPartitionPaths.size(), Review Comment: Can you check the test failures. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6960] Support read partition values from path when schema evolution enabled [hudi]
danny0405 commented on code in PR #9889: URL: https://github.com/apache/hudi/pull/9889#discussion_r1371009429 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala: ## @@ -149,27 +152,10 @@ case class BaseFileOnlyRelation(override val sqlContext: SQLContext, val enableFileIndex = HoodieSparkConfUtils.getConfigValue(optParams, sparkSession.sessionState.conf, ENABLE_HOODIE_FILE_INDEX.key, ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean if (enableFileIndex && globPaths.isEmpty) { - // NOTE: There are currently 2 ways partition values could be fetched: - // - Source columns (producing the values used for physical partitioning) will be read - // from the data file - // - Values parsed from the actual partition path would be appended to the final dataset - // - //In the former case, we don't need to provide the partition-schema to the relation, - //therefore we simply stub it w/ empty schema and use full table-schema as the one being - //read from the data file. Review Comment: But after your change, the partiton shema is always resolved from partiton path which looks like a regression? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6960] Support read partition values from path when schema evolution enabled [hudi]
danny0405 commented on code in PR #9889: URL: https://github.com/apache/hudi/pull/9889#discussion_r1368068891 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala: ## @@ -149,27 +152,10 @@ case class BaseFileOnlyRelation(override val sqlContext: SQLContext, val enableFileIndex = HoodieSparkConfUtils.getConfigValue(optParams, sparkSession.sessionState.conf, ENABLE_HOODIE_FILE_INDEX.key, ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean if (enableFileIndex && globPaths.isEmpty) { - // NOTE: There are currently 2 ways partition values could be fetched: - // - Source columns (producing the values used for physical partitioning) will be read - // from the data file - // - Values parsed from the actual partition path would be appended to the final dataset - // - //In the former case, we don't need to provide the partition-schema to the relation, - //therefore we simply stub it w/ empty schema and use full table-schema as the one being - //read from the data file. Review Comment: Can you ensure that HUDI-4161 been solved after this change? Can you elaborate why `shouldExtractPartitionValuesFromPartitionPath` should be false after schema evolution? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778351985 ## CI report: * b8bc65dc87cfd1305634bf16f96a97944ce85816 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20432) * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN * d96a7423b1c1bae13148744547726ed95ee5c6b7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20465) * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20467) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
hudi-bot commented on PR #9883: URL: https://github.com/apache/hudi/pull/9883#issuecomment-1778352024 ## CI report: * c140ff462f58b649d45c782ce072b683cd908c1c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20441) * 985e9f099aff341d7d0cec4384ef82b7dcdd4de8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20469) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6959] Bulk insert V2 do not rollback failed instant on abort [hudi]
boneanxs commented on PR #9887: URL: https://github.com/apache/hudi/pull/9887#issuecomment-1778351293 @danny0405 Yea, sure, will raise the pr soon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6959] Bulk insert V2 do not rollback failed instant on abort [hudi]
danny0405 commented on PR #9887: URL: https://github.com/apache/hudi/pull/9887#issuecomment-1778350430 @stream2000 @boneanxs Merge it first because it looks like a bug fix. Can you finalize it with following up PRs with more tests or probable the correct fix with `#abort`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]flink-sql write hudi use TIMESTAMP, when hive query, it get time+8h question, use TIMESTAMP_LTZ, the hive schema is bigint but timestamp [hudi]
danny0405 commented on issue #9864: URL: https://github.com/apache/hudi/issues/9864#issuecomment-1778351080 > but TIMESTAMP cannot be changed to long What do you mean by changed to long? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6959] Bulk insert V2 do not rollback failed instant on abort [hudi]
boneanxs commented on PR #9887: URL: https://github.com/apache/hudi/pull/9887#issuecomment-1778346251 > we can confirm that datasource v2 won't waiting for all subtasks to be canceled before calling `org.apache.hudi.table.action.commit.BulkInsertDataInternalWriterHelper#abort` should be `org.apache.hudi.spark3.internal.HoodieDataSourceInternalBatchWrite#abort` instead of `org.apache.hudi.table.action.commit.BulkInsertDataInternalWriterHelper#abort` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
hudi-bot commented on PR #9883: URL: https://github.com/apache/hudi/pull/9883#issuecomment-1778343885 ## CI report: * c140ff462f58b649d45c782ce072b683cd908c1c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20441) * 985e9f099aff341d7d0cec4384ef82b7dcdd4de8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778343787 ## CI report: * b8bc65dc87cfd1305634bf16f96a97944ce85816 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20432) * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN * d96a7423b1c1bae13148744547726ed95ee5c6b7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20465) * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6975] Optimize the implementation of DayBasedCompactionStrategy [hudi]
ksmou commented on code in PR #9911: URL: https://github.com/apache/hudi/pull/9911#discussion_r1370998924 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/strategy/DayBasedCompactionStrategy.java: ## @@ -63,21 +60,9 @@ public Comparator getComparator() { return comparator; } - @Override - public List orderAndFilter(HoodieWriteConfig writeConfig, - List operations, List pendingCompactionPlans) { -// Iterate through the operations and accept operations as long as we are within the configured target partitions -// limit -return operations.stream() - .collect(Collectors.groupingBy(HoodieCompactionOperation::getPartitionPath)).entrySet().stream() - .sorted(Map.Entry.comparingByKey(comparator)).limit(writeConfig.getTargetPartitionsPerDayBasedCompaction()) -.flatMap(e -> e.getValue().stream()).collect(Collectors.toList()); - } - @Override public List filterPartitionPaths(HoodieWriteConfig writeConfig, List allPartitionPaths) { -return allPartitionPaths.stream().map(partition -> partition.replace("/", "-")) -.sorted(Comparator.reverseOrder()).map(partitionPath -> partitionPath.replace("-", "/")) +return allPartitionPaths.stream().sorted(comparator) .collect(Collectors.toList()).subList(0, Math.min(allPartitionPaths.size(), Review Comment: Yes. mainly remove the redundant orderAndFilter operation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6959] Bulk insert V2 do not rollback failed instant on abort [hudi]
stream2000 commented on code in PR #9887: URL: https://github.com/apache/hudi/pull/9887#discussion_r1370998616 ## hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/DataSourceInternalWriterHelper.java: ## @@ -97,7 +97,6 @@ public void commit(List writeStatuses) { public void abort() { LOG.error("Commit " + instantTime + " aborted "); -writeClient.rollback(instantTime); Review Comment: Will add a test in the next PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6959) Do not rollback current instant when bulk insert as row failed
[ https://issues.apache.org/jira/browse/HUDI-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6959. Resolution: Fixed Fixed via master branch: 65dd645b487a61fbca7e4e4b849d1f2f1ec143f9 > Do not rollback current instant when bulk insert as row failed > -- > > Key: HUDI-6959 > URL: https://issues.apache.org/jira/browse/HUDI-6959 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Reporter: Qijun Fu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0, 0.14.1 > > > When org.apache.hudi.spark3.internal.HoodieDataSourceInternalBatchWrite#abort > is called, all the subtasks may not have already been canceled. So if we > rollback current instant immediately, there may be new files been written > after rollback scheduled, which will cause dirty data. > > We should rollback the failed instant using common mechanism eager and lazy -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6959) Do not rollback current instant when bulk insert as row failed
[ https://issues.apache.org/jira/browse/HUDI-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-6959: - Fix Version/s: 1.0.0 0.14.1 > Do not rollback current instant when bulk insert as row failed > -- > > Key: HUDI-6959 > URL: https://issues.apache.org/jira/browse/HUDI-6959 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Reporter: Qijun Fu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0, 0.14.1 > > > When org.apache.hudi.spark3.internal.HoodieDataSourceInternalBatchWrite#abort > is called, all the subtasks may not have already been canceled. So if we > rollback current instant immediately, there may be new files been written > after rollback scheduled, which will cause dirty data. > > We should rollback the failed instant using common mechanism eager and lazy -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-6959] Bulk insert as row do not rollback failed instant on abort (#9887)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 65dd645b487 [HUDI-6959] Bulk insert as row do not rollback failed instant on abort (#9887) 65dd645b487 is described below commit 65dd645b487a61fbca7e4e4b849d1f2f1ec143f9 Author: StreamingFlames <18889897...@163.com> AuthorDate: Tue Oct 24 20:36:28 2023 -0500 [HUDI-6959] Bulk insert as row do not rollback failed instant on abort (#9887) --- .../java/org/apache/hudi/internal/DataSourceInternalWriterHelper.java | 1 - .../src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala | 3 +-- 2 files changed, 1 insertion(+), 3 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/DataSourceInternalWriterHelper.java b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/DataSourceInternalWriterHelper.java index 4ad6c2066a3..58bb3e4d608 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/DataSourceInternalWriterHelper.java +++ b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/DataSourceInternalWriterHelper.java @@ -97,7 +97,6 @@ public class DataSourceInternalWriterHelper { public void abort() { LOG.error("Commit " + instantTime + " aborted "); -writeClient.rollback(instantTime); writeClient.close(); } diff --git a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala index 14bc84948c1..8cc107a24fb 100644 --- a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala +++ b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala @@ -1714,8 +1714,7 @@ class TestInsertTable extends HoodieSparkSqlTestBase { } } - // [HUDI-6900] TestInsertTable "Test Bulk Insert Into Consistent Hashing Bucket Index Table" is failing continuously - ignore("Test Bulk Insert Into Consistent Hashing Bucket Index Table") { + test("Test Bulk Insert Into Consistent Hashing Bucket Index Table") { withSQLConf("hoodie.datasource.write.operation" -> "bulk_insert") { Seq("false", "true").foreach { bulkInsertAsRow => withTempDir { tmp =>
Re: [PR] [HUDI-6959] Bulk insert V2 do not rollback failed instant on abort [hudi]
danny0405 merged PR #9887: URL: https://github.com/apache/hudi/pull/9887 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778336139 ## CI report: * b8bc65dc87cfd1305634bf16f96a97944ce85816 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20432) * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN * d96a7423b1c1bae13148744547726ed95ee5c6b7 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Add tests on combine parallelism [hudi]
hudi-bot commented on PR #9731: URL: https://github.com/apache/hudi/pull/9731#issuecomment-1778335971 ## CI report: * 047941b66ee52a99f626fd0dadb72581d9855385 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19966) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6975] Optimize the implementation of DayBasedCompactionStrategy [hudi]
danny0405 commented on code in PR #9911: URL: https://github.com/apache/hudi/pull/9911#discussion_r1370996694 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/strategy/DayBasedCompactionStrategy.java: ## @@ -63,21 +60,9 @@ public Comparator getComparator() { return comparator; } - @Override - public List orderAndFilter(HoodieWriteConfig writeConfig, - List operations, List pendingCompactionPlans) { -// Iterate through the operations and accept operations as long as we are within the configured target partitions -// limit -return operations.stream() - .collect(Collectors.groupingBy(HoodieCompactionOperation::getPartitionPath)).entrySet().stream() - .sorted(Map.Entry.comparingByKey(comparator)).limit(writeConfig.getTargetPartitionsPerDayBasedCompaction()) -.flatMap(e -> e.getValue().stream()).collect(Collectors.toList()); - } - @Override public List filterPartitionPaths(HoodieWriteConfig writeConfig, List allPartitionPaths) { -return allPartitionPaths.stream().map(partition -> partition.replace("/", "-")) -.sorted(Comparator.reverseOrder()).map(partitionPath -> partitionPath.replace("-", "/")) +return allPartitionPaths.stream().sorted(comparator) .collect(Collectors.toList()).subList(0, Math.min(allPartitionPaths.size(), Review Comment: Seems a pure code optimization? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6929) Lazy loading dynamically for CompletionTimeQueryView
[ https://issues.apache.org/jira/browse/HUDI-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6929. Resolution: Fixed Fixed via master branch: bb8fc3e9f632a1fc3647fda63d482849355df2b7 > Lazy loading dynamically for CompletionTimeQueryView > > > Key: HUDI-6929 > URL: https://issues.apache.org/jira/browse/HUDI-6929 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6962) Correct the behavior of bulk insert for NB-CC
[ https://issues.apache.org/jira/browse/HUDI-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-6962: - Fix Version/s: 1.0.0 > Correct the behavior of bulk insert for NB-CC > -- > > Key: HUDI-6962 > URL: https://issues.apache.org/jira/browse/HUDI-6962 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Jing Zhang >Assignee: Jing Zhang >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > How to handle the case if the multiple writer contains a job with bulk insert > operation? > 1. Generated file group id: Generate a fixed file group ID because other jobs > will use the fixed file group id suffix instead of random uuid suffix. The > behavior needs to be consistent to prevent later writer jobs from writing the > records with same primary key to different file groups. > 2.Deal with the transaction: The conflict resolution of bulk insert could not > defer to the compaction phase. Because bulk insert writers flush data into > base files, if there are multiple bulk insert job, there might exists > multiple base files in the same bucket. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6962) Correct the behavior of bulk insert for NB-CC
[ https://issues.apache.org/jira/browse/HUDI-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6962. Resolution: Fixed Fixed via master branch: f05b5fc9db38e0bc4ccc2941cccf049991b67db2 > Correct the behavior of bulk insert for NB-CC > -- > > Key: HUDI-6962 > URL: https://issues.apache.org/jira/browse/HUDI-6962 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Jing Zhang >Assignee: Jing Zhang >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > How to handle the case if the multiple writer contains a job with bulk insert > operation? > 1. Generated file group id: Generate a fixed file group ID because other jobs > will use the fixed file group id suffix instead of random uuid suffix. The > behavior needs to be consistent to prevent later writer jobs from writing the > records with same primary key to different file groups. > 2.Deal with the transaction: The conflict resolution of bulk insert could not > defer to the compaction phase. Because bulk insert writers flush data into > base files, if there are multiple bulk insert job, there might exists > multiple base files in the same bucket. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC (#9896)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new f05b5fc9db3 [HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC (#9896) f05b5fc9db3 is described below commit f05b5fc9db38e0bc4ccc2941cccf049991b67db2 Author: Jing Zhang AuthorDate: Wed Oct 25 09:29:13 2023 +0800 [HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC (#9896) * Flink bulk_insert with fixed file group id suffix if NB-CC is enabled; * The bulk_insert writer should resolve conflicts with other writers under OCC strategies. --- .../apache/hudi/client/utils/TransactionUtils.java | 5 +- .../org/apache/hudi/config/HoodieWriteConfig.java | 11 + .../apache/hudi/client/HoodieFlinkWriteClient.java | 4 +- .../hudi/sink/StreamWriteOperatorCoordinator.java | 2 +- .../sink/bucket/BucketBulkInsertWriterHelper.java | 14 +- .../hudi/sink/bulk/BulkInsertWriteFunction.java| 15 +- .../java/org/apache/hudi/sink/utils/Pipelines.java | 3 +- .../hudi/sink/TestWriteMergeOnReadWithCompact.java | 116 +++ .../hudi/sink/utils/BulkInsertFunctionWrapper.java | 232 + .../org/apache/hudi/sink/utils/TestWriteBase.java | 25 +++ .../test/java/org/apache/hudi/utils/TestData.java | 5 +- .../org/apache/hudi/adapter/TestStreamConfigs.java | 32 +++ .../org/apache/hudi/adapter/TestStreamConfigs.java | 32 +++ .../org/apache/hudi/adapter/TestStreamConfigs.java | 32 +++ .../org/apache/hudi/adapter/TestStreamConfigs.java | 35 .../org/apache/hudi/adapter/TestStreamConfigs.java | 35 16 files changed, 581 insertions(+), 17 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/TransactionUtils.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/TransactionUtils.java index 15f6be8f79a..1bea51721c8 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/TransactionUtils.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/TransactionUtils.java @@ -21,6 +21,7 @@ package org.apache.hudi.client.utils; import org.apache.hudi.client.transaction.ConcurrentOperation; import org.apache.hudi.client.transaction.ConflictResolutionStrategy; import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.common.model.WriteOperationType; import org.apache.hudi.common.table.HoodieTableMetaClient; import org.apache.hudi.common.table.timeline.HoodieInstant; import org.apache.hudi.common.table.timeline.HoodieTimeline; @@ -67,8 +68,8 @@ public class TransactionUtils { Option lastCompletedTxnOwnerInstant, boolean reloadActiveTimeline, Set pendingInstants) throws HoodieWriteConflictException { -// Skip to resolve conflict if using non-blocking concurrency control -if (config.getWriteConcurrencyMode().supportsOptimisticConcurrencyControl() && !config.isNonBlockingConcurrencyControl()) { +WriteOperationType operationType = thisCommitMetadata.map(HoodieCommitMetadata::getOperationType).orElse(null); +if (config.needResolveWriteConflict(operationType)) { // deal with pendingInstants Stream completedInstantsDuringCurrentWriteOperation = getCompletedInstantsDuringCurrentWriteOperation(table.getMetaClient(), pendingInstants); diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java index c9e9b94b1a9..8c08beaaef9 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java @@ -46,6 +46,7 @@ import org.apache.hudi.common.model.HoodieTableType; import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload; import org.apache.hudi.common.model.RecordPayloadType; import org.apache.hudi.common.model.WriteConcurrencyMode; +import org.apache.hudi.common.model.WriteOperationType; import org.apache.hudi.common.table.HoodieTableConfig; import org.apache.hudi.common.table.log.block.HoodieLogBlock; import org.apache.hudi.common.table.marker.MarkerType; @@ -2616,6 +2617,16 @@ public class HoodieWriteConfig extends HoodieConfig { return props.getInteger(WRITES_FILEID_ENCODING, HoodieMetadataPayload.RECORD_INDEX_FIELD_FILEID_ENCODING_UUID); } + public boolean needResolveWriteConflict(WriteOperationType operationType) { +if (getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()) { + // NB-CC don't need to resolve write conflict except bulk insert operation + return WriteOperationType.BULK_INSERT == operationType || !isNonBlockingConcurrencyControl(); +}
Re: [PR] [HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC [hudi]
danny0405 merged PR #9896: URL: https://github.com/apache/hudi/pull/9896 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC [hudi]
danny0405 commented on PR #9896: URL: https://github.com/apache/hudi/pull/9896#issuecomment-1778327433 The failed test is known to be flaky: `TestHoodieLogFormat.testAvroLogRecordReaderWithMixedInsertsCorruptsRollbackAndMergedLogBlock` : https://pipelinesghubeus23.actions.githubusercontent.com/2uhBcZr3qV5ap2vibMf4tU0bjg49uuN9wlovCTzCjH6fMLAme0/_apis/pipelines/1/runs/42270/signedlogcontent/13?urlExpires=2023-10-25T01%3A24%3A24.0091887Z&urlSigningMethod=HMACV1&urlSignature=5EgWFpWhEswB%2FySzG2hp2q99FnPaNFTCC3zvozWazEM%3D -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]flink 写hudi 同步hive后,timestamp字段为什么是bigint类型,如何才能让同步到hive的字段保持timestamp类型 [hudi]
linrongjun-l commented on issue #9766: URL: https://github.com/apache/hudi/issues/9766#issuecomment-1778312506 > > Before release 0.14.0, there is a sync param `hive_sync.support_timestamp`, when enabled, the `Timestamp(6)` type would be synced as `TIMESTAMP` in hive, since release 0.14.0, all the timestamp type would be synced as `TIMESTAMP`. > > thanks for your reply. when i use hive_sync.support_timestamp enabled ,in hive the field type is TIMESTAMP indeed. but when i select the value in hive,there is error :Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable I also met the same problem, how did you solve it at last? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6977] Upgrade hadoop version from 2.10.1 to 2.10.2 [hudi]
hudi-bot commented on PR #9914: URL: https://github.com/apache/hudi/pull/9914#issuecomment-1778296527 ## CI report: * 6aa578288e31414d8f13c37525ed4e2b7d9a6521 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20462) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778296319 ## CI report: * b8bc65dc87cfd1305634bf16f96a97944ce85816 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20432) * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Add tests on combine parallelism [hudi]
hudi-bot commented on PR #9731: URL: https://github.com/apache/hudi/pull/9731#issuecomment-1778296103 ## CI report: * 047941b66ee52a99f626fd0dadb72581d9855385 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Add tests on combine parallelism [hudi]
yihua commented on PR #9731: URL: https://github.com/apache/hudi/pull/9731#issuecomment-1778294686 CI is green. https://github.com/apache/hudi/assets/2497195/b14e4414-fbb5-4f1b-a3e0-5a2d8335775d";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6977] Upgrade hadoop version from 2.10.1 to 2.10.2 [hudi]
hudi-bot commented on PR #9914: URL: https://github.com/apache/hudi/pull/9914#issuecomment-1778289302 ## CI report: * 6aa578288e31414d8f13c37525ed4e2b7d9a6521 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6978) Fix TestMergeIntoTable2 test
[ https://issues.apache.org/jira/browse/HUDI-6978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6978: Description: For the test TestMergeIntoTable2@"Test only insert for source table in dup key without preCombineField" was: For the test "Test only insert for source table in dup key without preCombineField" @"Test only insert for source table in dup key without preCombineField" > Fix TestMergeIntoTable2 test > > > Key: HUDI-6978 > URL: https://issues.apache.org/jira/browse/HUDI-6978 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > > For the test > TestMergeIntoTable2@"Test only insert for source table in dup key without > preCombineField" -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6978) Fix TestMergeIntoTable2 test
[ https://issues.apache.org/jira/browse/HUDI-6978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6978: Description: For the test "Test only insert for source table in dup key without preCombineField" @"Test only insert for source table in dup key without preCombineField" was:For @"Test only insert for source table in dup key without preCombineField" > Fix TestMergeIntoTable2 test > > > Key: HUDI-6978 > URL: https://issues.apache.org/jira/browse/HUDI-6978 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > > For the test > "Test only insert for source table in dup key without preCombineField" > @"Test only insert for source table in dup key without preCombineField" -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6978) Fix TestMergeIntoTable2 test
[ https://issues.apache.org/jira/browse/HUDI-6978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6978: Description: For the test TestMergeIntoTable2@"Test only insert for source table in dup key without preCombineField", after adding " spark.sql(s"set ${MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT.key} = 0")", the test fails: {code:java} Expected Array([1,a2,10.4,1004,2021-03-21], [1,a2,10.4,1004,2021-03-21], [3,a3,10.3,1003,2021-03-21]), but got Array([1,a2,10.2,1002,2021-03-21], [1,a2,10.4,1004,2021-03-21], [3,a3,10.3,1003,2021-03-21]) ScalaTestFailureLocation: org.apache.spark.sql.hudi.HoodieSparkSqlTestBase at (HoodieSparkSqlTestBase.scala:109) org.scalatest.exceptions.TestFailedException: Expected Array([1,a2,10.4,1004,2021-03-21], [1,a2,10.4,1004,2021-03-21], [3,a3,10.3,1003,2021-03-21]), but got Array([1,a2,10.2,1002,2021-03-21], [1,a2,10.4,1004,2021-03-21], [3,a3,10.3,1003,2021-03-21]) at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1562) at org.scalatest.Assertions.assertResult(Assertions.scala:867) at org.scalatest.Assertions.assertResult$(Assertions.scala:863) at org.scalatest.funsuite.AnyFunSuite.assertResult(AnyFunSuite.scala:1562) at org.apache.spark.sql.hudi.HoodieSparkSqlTestBase.checkAnswer(HoodieSparkSqlTestBase.scala:109) at org.apache.spark.sql.hudi.TestMergeIntoTable2.$anonfun$new$36(TestMergeIntoTable2.scala:897) at org.apache.spark.sql.hudi.TestMergeIntoTable2.$anonfun$new$36$adapted(TestMergeIntoTable2.scala:841) at org.apache.spark.sql.hudi.HoodieSparkSqlTestBase.withTempDir(HoodieSparkSqlTestBase.scala:77) at org.apache.spark.sql.hudi.TestMergeIntoTable2.$anonfun$new$35(TestMergeIntoTable2.scala:841) at org.apache.spark.sql.hudi.TestMergeIntoTable2.$anonfun$new$35$adapted(TestMergeIntoTable2.scala:840) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.hudi.TestMergeIntoTable2.$anonfun$new$34(TestMergeIntoTable2.scala:840) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hudi.HoodieSparkSqlTestBase.$anonfun$test$1(HoodieSparkSqlTestBase.scala:85) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:189) at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1562) at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:187) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:199) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:199) at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:181) at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1562) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:232) at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:232) at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:231) at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1562) at org.scalatest.Suite.run(Suite.scala:1112) at org.scalatest.Suite.run$(Suite.scala:1094) at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1562) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:236) at org.scalatest.SuperEngine.runImpl(Engine.scala:535) at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:236) at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:235) at org.apache.spark.sql.hudi.HoodieSparkSqlTestBase.org$scalatest$BeforeAndAfterAll$$super$run(HoodieSparkSqlTestBase.scala:44) at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) at org.scalatest.BeforeAndAfterAll.run(Bef
[jira] [Updated] (HUDI-6978) Fix TestMergeIntoTable2 test
[ https://issues.apache.org/jira/browse/HUDI-6978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6978: Description: For @"Test only insert for source table in dup key without preCombineField" > Fix TestMergeIntoTable2 test > > > Key: HUDI-6978 > URL: https://issues.apache.org/jira/browse/HUDI-6978 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > > For @"Test only insert for source table in dup key without preCombineField" -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6978) Fix TestMergeIntoTable2 test
Ethan Guo created HUDI-6978: --- Summary: Fix TestMergeIntoTable2 test Key: HUDI-6978 URL: https://issues.apache.org/jira/browse/HUDI-6978 Project: Apache Hudi Issue Type: Improvement Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[I] [SUPPORT] Control file sizing during FULL_RECORD bootstrap mode [hudi]
fenil25 opened a new issue, #9915: URL: https://github.com/apache/hudi/issues/9915 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? Yes - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** I want to bootstrap a table into Hudi. Size of the table is around 12 TB. The base path of the source table is in S3. Its a partitioned hive table and the average parquet file size is 2.5Gb. I used the FULL_RECORD bootstrap mode using Spark for bootstrapping and it was successful. However, the average file size of hudi table was around 120 Mb which aligns with the default which ended up creating 100K+ files. I am using S3 storage as the DFS. This made the read performance quite slow. I am not using any table partitioning yet. I did set `hoodie.parquet.max.file.size": 1258291200,` (~1.2Gb) but this configuration was completely ignored. FAQs and File Sizing docs mainly talk about ways to adjust the file size while streaming data into Hudi. How can I control the file size during the bootstrapping process itself? I also read in the docs that - ``` A full record bootstrap is functionally equivalent to a bulk-insert. ``` Does that mean both are essentially the same. Is there any advantage of using one over the another? (Note: _METADATA_ONLY does not work for our use-case_) **Environment Description** Running it via EMR * Hudi version : 13.0 * Spark version : 3.3 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Test ci [hudi]
kkalanda-score closed pull request #9095: Test ci URL: https://github.com/apache/hudi/pull/9095 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6551] A new slashed month partition value extractor [hudi]
yihua closed pull request #9184: [HUDI-6551] A new slashed month partition value extractor URL: https://github.com/apache/hudi/pull/9184 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1778089960 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN * 7c353cd134d555bf0adfb50a64f012b609e75308 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20463) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6977] Upgrade hadoop version from 2.10.1 to 2.10.2 [hudi]
hudi-bot commented on PR #9914: URL: https://github.com/apache/hudi/pull/9914#issuecomment-1778090503 ## CI report: * 6aa578288e31414d8f13c37525ed4e2b7d9a6521 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20462) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6551] A new slashed month partition value extractor [hudi]
yihua commented on PR #9184: URL: https://github.com/apache/hudi/pull/9184#issuecomment-1778090063 Closing this PR now. @banank1989 feel free to reopen it you need additional functionality. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Test ci [hudi]
yihua commented on PR #9095: URL: https://github.com/apache/hudi/pull/9095#issuecomment-1778088836 @kkalanda-score do you still need this PR? If not, the PR should be closed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6898] Medatawriter closing in tests, update logging [hudi]
yihua merged PR #9768: URL: https://github.com/apache/hudi/pull/9768 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
yihua commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778076328 I discussed the comments with @danny0405 offline. Two things to address in this PR: (1) Instead of putting both partial and full schemas in the log block header, when partial updates are enabled, only the partial schema is added to the log block header in the same `SCHEMA` header and the full schema for snapshot reads is always going to be passed in from the table schema. To indicate the schema is partial, a new log block header `IS_PARTIAL` should be added. (2) We should let users in the MERGE INTO statement to specify if they want partial updates in the log files in MOR tables, e.g., using sth like `col = EXISTING` to indicate that the column values should be kept as is. We may not support this in the PR, but instead we should have an interim write config to control this behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
yihua commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1370828538 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala: ## @@ -411,10 +414,14 @@ object ExpressionPayload { parseSchema(props.getProperty(PAYLOAD_RECORD_AVRO_SCHEMA)) } - private def getWriterSchema(props: Properties): Schema = { - ValidationUtils.checkArgument(props.containsKey(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key), - s"Missing ${HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key} property") -parseSchema(props.getProperty(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key)) + private def getWriterSchema(props: Properties, isPartialUpdate: Boolean): Schema = { +if (isPartialUpdate) { + parseSchema(props.getProperty(HoodieWriteConfig.WRITE_PARTIAL_UPDATE_SCHEMA.key)) Review Comment: Agree that option 1 is the most natural handling. In the current schema evolution on write, the write schema is evolved based on the input, and the evolved schema is written to the commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6836] Shutting down deltastreamer in tests and shutting down metrics for write client [hudi]
yihua commented on PR #9667: URL: https://github.com/apache/hudi/pull/9667#issuecomment-1778007738 @pratyakshsharma are you good with the changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6877] Fix avro read issue after ALTER TABLE RENAME DDL on Spark3_1 [hudi]
yihua commented on code in PR #9752: URL: https://github.com/apache/hudi/pull/9752#discussion_r1370788996 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieDataBlock.java: ## @@ -115,6 +114,35 @@ public byte[] getContentBytes() throws IOException { return serializeRecords(records.get()); } + private Schema getReaderSchema(Option readerSchemaOpt) { +Schema writerSchema = getWriterSchema(super.getLogBlockHeader()); +// If no reader-schema has been provided assume writer-schema as one +if (!readerSchemaOpt.isPresent()) { + return writerSchema; +} + +// Handle table renames when there are still log files +Schema readerSchema = readerSchemaOpt.get(); +if (isHandleDifferingNamespaceRequired(readerSchema, writerSchema)) { + return writerSchema; +} else { + return readerSchema; +} + } + + /** + * Spark3.1 uses avro:1.8.2, which matches fields by their fully qualified name. If namespaces are differs, reads will fail for fields that have the same name and type, but differing name(spaces). + * Such cases can arise when an ALTER-TABLE-RENAME ddl is performed. + * + * @param readerSchema the reader schema + * @param writerSchema the writer schema + * @return boolean if handling of differing namespaces between reader and writer schema are required + */ + private static boolean isHandleDifferingNamespaceRequired(Schema readerSchema, Schema writerSchema) { +return readerSchema.getClass().getPackage().getImplementationVersion().compareTo("1.8.2") <= 0 +&& !readerSchema.getName().equals(writerSchema.getName()); + } Review Comment: The fix works. But I'm thinking at this layer of reading log files, such details should not be exposed. It's better to fix the schema generation in ALTER-TABLE-RENAME DDL to have consistent namespaces, or resolve the schema's namespace on upper layers, e.g., `TableSchemaResolver`. And the schema namespace should not change across Hudi commits. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6898] Medatawriter closing in tests, update logging [hudi]
hudi-bot commented on PR #9768: URL: https://github.com/apache/hudi/pull/9768#issuecomment-1778003218 ## CI report: * 55beb62d168b2c9b9d99f0c3765637d441f58b5f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20458) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6877] Fix avro read issue after ALTER TABLE RENAME DDL on Spark3_1 [hudi]
yihua commented on PR #9752: URL: https://github.com/apache/hudi/pull/9752#issuecomment-1777997252 > Seems we have plan to mograte to avro above 1.8.2 right cc @yihua ~ The Avro dependency version is tied to Spark version and Avro 1.8.2 is tied to Spark 3.1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6895][WIP] Change default timeline timezone from local to UTC [hudi]
yihua commented on PR #9794: URL: https://github.com/apache/hudi/pull/9794#issuecomment-1777989084 @codope do we still plan to land this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6959] Bulk insert V2 do not rollback failed instant on abort [hudi]
yihua commented on code in PR #9887: URL: https://github.com/apache/hudi/pull/9887#discussion_r1370778292 ## hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/DataSourceInternalWriterHelper.java: ## @@ -97,7 +97,6 @@ public void commit(List writeStatuses) { public void abort() { LOG.error("Commit " + instantTime + " aborted "); -writeClient.rollback(instantTime); Review Comment: The fix makes sense based on the information created. @stream2000 could you add a test to verify that after bulk insert with DS v2 fails, the commit is left inflight, and a subsequent new writer / transaction rolls back the commit? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]
yihua commented on code in PR #9888: URL: https://github.com/apache/hudi/pull/9888#discussion_r1370761534 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodiePartitionCDCFileGroupMapping.scala: ## @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi + +import org.apache.hudi.common.model.HoodieFileGroupId +import org.apache.hudi.common.table.cdc.HoodieCDCFileSplit +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.util.{ArrayData, MapData} +import org.apache.spark.sql.types.{DataType, Decimal} +import org.apache.spark.unsafe.types.{CalendarInterval, UTF8String} + +import java.util + +case class HoodiePartitionCDCFileGroupMapping(partitionValues: InternalRow, + fileGroups: Map[HoodieFileGroupId, List[HoodieCDCFileSplit]] + ) extends InternalRow { + + def getFileSplitsFor(fileGroupId: HoodieFileGroupId): Option[List[HoodieCDCFileSplit]] = { +fileGroups.get(fileGroupId) + } + + override def numFields: Int = { +partitionValues.numFields + } + + override def setNullAt(i: Int): Unit = { +partitionValues.setNullAt(i) + } + + override def update(i: Int, value: Any): Unit = { +partitionValues.update(i, value) + } + + override def copy(): InternalRow = { +HoodiePartitionCDCFileGroupMapping(partitionValues.copy(), fileGroups) + } + + override def isNullAt(ordinal: Int): Boolean = { +partitionValues.isNullAt(ordinal) + } + + override def getBoolean(ordinal: Int): Boolean = { +partitionValues.getBoolean(ordinal) + } + + override def getByte(ordinal: Int): Byte = { +partitionValues.getByte(ordinal) + } + + override def getShort(ordinal: Int): Short = { +partitionValues.getShort(ordinal) + } + + override def getInt(ordinal: Int): Int = { +partitionValues.getInt(ordinal) + } + + override def getLong(ordinal: Int): Long = { +partitionValues.getLong(ordinal) + } + + override def getFloat(ordinal: Int): Float = { +partitionValues.getFloat(ordinal) + } + + override def getDouble(ordinal: Int): Double = { +partitionValues.getDouble(ordinal) + } + + override def getDecimal(ordinal: Int, precision: Int, scale: Int): Decimal = { +partitionValues.getDecimal(ordinal, precision, scale) + } + + override def getUTF8String(ordinal: Int): UTF8String = { +partitionValues.getUTF8String(ordinal) + } + + override def getBinary(ordinal: Int): Array[Byte] = { +partitionValues.getBinary(ordinal) + } + + override def getInterval(ordinal: Int): CalendarInterval = { +partitionValues.getInterval(ordinal) + } + + override def getStruct(ordinal: Int, numFields: Int): InternalRow = { +partitionValues.getStruct(ordinal, numFields) + } + + override def getArray(ordinal: Int): ArrayData = { +partitionValues.getArray(ordinal) + } + + override def getMap(ordinal: Int): MapData = { +partitionValues.getMap(ordinal) + } + + override def get(ordinal: Int, dataType: DataType): AnyRef = { +partitionValues.getMap(ordinal) Review Comment: this should be `partitionValues.get(ordinal, dataType)`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]
yihua commented on code in PR #9888: URL: https://github.com/apache/hudi/pull/9888#discussion_r1370759016 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala: ## @@ -141,12 +145,37 @@ class HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState, case _ => baseFileReader(file) } } +// CDC queries. +case hoodiePartitionCDCFileGroupSliceMapping: HoodiePartitionCDCFileGroupMapping => + val filePath: Path = sparkAdapter.getSparkPartitionedFileUtils.getPathFromPartitionedFile(file) + val fileGroupId: HoodieFileGroupId = new HoodieFileGroupId(filePath.getParent.toString, filePath.getName) + val fileSplits = hoodiePartitionCDCFileGroupSliceMapping.getFileSplitsFor(fileGroupId).get.toArray + val fileGroupSplit: HoodieCDCFileGroupSplit = HoodieCDCFileGroupSplit(fileSplits) + buildCDCRecordIterator(fileGroupSplit, preMergeBaseFileReader, hadoopConf, requiredSchema, props) // TODO: Use FileGroupReader here: HUDI-6942. case _ => baseFileReader(file) } } } + protected def buildCDCRecordIterator(cdcFileGroupSplit: HoodieCDCFileGroupSplit, + preMergeBaseFileReader: PartitionedFile => Iterator[InternalRow], + hadoopConf: Configuration, + requiredSchema: StructType, + props: TypedProperties): Iterator[InternalRow] = { +val metaClient = HoodieTableMetaClient.initTableAndGetMetaClient(hadoopConf, tableState.tablePath, props) +val cdcSchema = CDCRelation.FULL_CDC_SPARK_SCHEMA +new CDCFileGroupIterator( Review Comment: This does not seem to leverage `HoodieFileGroupReader`. ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala: ## @@ -119,6 +120,35 @@ case class MergeOnReadIncrementalRelation(override val sqlContext: SQLContext, } } + def listFileSplits(partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Map[InternalRow, Seq[FileSlice]] = { Review Comment: Could this be extracted out as a util method instead of sitting inside the MOR incremental relation, which will not be used by the new Hudi parquet file format class? ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/CDCFileGroupIterator.scala: ## @@ -0,0 +1,558 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.cdc + +import org.apache.avro.Schema +import org.apache.avro.generic.{GenericData, GenericRecord, IndexedRecord} +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hudi.HoodieBaseRelation.BaseFileReader +import org.apache.hudi.HoodieConversionUtils.toScalaOption +import org.apache.hudi.HoodieDataSourceHelper.AvroDeserializerSupport +import org.apache.hudi.avro.HoodieAvroUtils +import org.apache.hudi.{AvroConversionUtils, AvroProjection, HoodieFileIndex, HoodieMergeOnReadFileSplit, HoodieTableSchema, HoodieTableState, LogFileIterator, RecordMergingFileIterator, SparkAdapterSupport} +import org.apache.hudi.common.config.{HoodieMetadataConfig, TypedProperties} +import org.apache.hudi.common.model.{FileSlice, HoodieAvroRecordMerger, HoodieLogFile, HoodieRecord, HoodieRecordMerger, HoodieRecordPayload} +import org.apache.hudi.common.table.HoodieTableMetaClient +import org.apache.hudi.common.table.cdc.{HoodieCDCFileSplit, HoodieCDCUtils} +import org.apache.hudi.common.table.cdc.HoodieCDCInferenceCase._ +import org.apache.hudi.common.table.log.HoodieCDCLogRecordIterator +import org.apache.hudi.common.table.cdc.HoodieCDCOperation._ +import org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode._ +import org.apache.hudi.common.util.ValidationUtils.checkState +import org.apache.hudi.config.HoodiePayloadConfig +import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory +import org.apache.spark.sql.HoodieCatalystEx
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1777922386 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN * 0fe4d74eb04601d878a44c6d8892168e1e321d1a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20405) * 7c353cd134d555bf0adfb50a64f012b609e75308 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20463) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org