date:20231024

Re: [I] ClassNotFoundException: MergeOnReadInputSplit [hudi]

2023-10-24 Thread via GitHub



ad1happy2go commented on issue #9474:
URL: https://github.com/apache/hudi/issues/9474#issuecomment-1778613662

   @jiangzzwy I tried the similar command and it worked for me. looks like some 
problem in your setup. Did you added the jar under $FLINK_HOME/lib. Let us know 
if you still faces this issue. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Add table name and range msg for streaming reads logs [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9912:
URL: https://github.com/apache/hudi/pull/9912#issuecomment-1778609048

   
   ## CI report:
   
   * aadc5fbc31b83cfff275fee66618071b0bc9e76d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20471)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6973] Instantiate HoodieFileGroupRecordBuffer inside new file group reader [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9910:
URL: https://github.com/apache/hudi/pull/9910#issuecomment-1778608974

   
   ## CI report:
   
   * f158692bc1611582566b3bbd76e49d07a290e802 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20447)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6975] Optimize the code of DayBasedCompactionStrategy [hudi]

2023-10-24 Thread via GitHub



ksmou commented on code in PR #9911:
URL: https://github.com/apache/hudi/pull/9911#discussion_r1371222617


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/strategy/DayBasedCompactionStrategy.java:
##
@@ -63,21 +60,9 @@ public Comparator getComparator() {
 return comparator;
   }
 
-  @Override
-  public List orderAndFilter(HoodieWriteConfig 
writeConfig,
-  List operations, List 
pendingCompactionPlans) {
-// Iterate through the operations and accept operations as long as we are 
within the configured target partitions
-// limit
-return operations.stream()
-
.collect(Collectors.groupingBy(HoodieCompactionOperation::getPartitionPath)).entrySet().stream()
-
.sorted(Map.Entry.comparingByKey(comparator)).limit(writeConfig.getTargetPartitionsPerDayBasedCompaction())
-.flatMap(e -> e.getValue().stream()).collect(Collectors.toList());
-  }
-
   @Override
   public List filterPartitionPaths(HoodieWriteConfig writeConfig, 
List allPartitionPaths) {
-return allPartitionPaths.stream().map(partition -> partition.replace("/", 
"-"))
-.sorted(Comparator.reverseOrder()).map(partitionPath -> 
partitionPath.replace("-", "/"))
+return allPartitionPaths.stream().sorted(comparator)
 .collect(Collectors.toList()).subList(0, 
Math.min(allPartitionPaths.size(),

Review Comment:
   If the original size `allPartitionPaths.size()` is smaller than 
`writeConfig.getTargetPartitionsPerDayBasedCompaction()`, only 
`subList(writeConfig.getTargetPartitionsPerDayBasedCompaction())` will throw 
IndexOutOfBoundsException. I think we can use `limit` to replace `subList()`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Parquet files got cleaned up even when cleaning operation failed hence leading to subsequent failed clustering and cleaning [hudi]

2023-10-24 Thread via GitHub



ad1happy2go commented on issue #9257:
URL: https://github.com/apache/hudi/issues/9257#issuecomment-1778596706

   @adityaverma1997 Sorry for all the delay's here. I did try to reproduce this 
couple of times but never got any error. Tried to mock up some failures too 
when cleaning happens. Actually it depends on when the cleaning exactly fails. 
Are you able to reproduce this consistently? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] AWS Glue Sync fails on a Hudi table with > 25 partitions [hudi]

2023-10-24 Thread via GitHub



codope closed issue #9806: [SUPPORT] AWS Glue Sync fails on a Hudi table with > 
25 partitions
URL: https://github.com/apache/hudi/issues/9806


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(Lscala/PartialFunction;)Lorg/apache/spark/sql/catalyst/p

2023-10-24 Thread via GitHub



ad1happy2go commented on issue #8614:
URL: https://github.com/apache/hudi/issues/8614#issuecomment-1778570591

   @danny0405 I think the issue is 
`org.apache.hudi:hudi-utilities-bundle_2.12:0.13.1` . As utilities bundle jar 
can't have each spark version specific dependency. So dont use the maven one 
and either try to build your own jar and use that. OR use the slim-bundle 
package.  We should not use both utilities-bundle and spark bundle together. 
utilities-bundle already have spark-bundle dependency. So ideally use utilities 
slim bundle. 
   
   @pushpavanthar I did asked you to try the same on this slack thread - 
https://apache-hudi.slack.com/archives/C4D716NPQ/p1697802409713149. Were you 
able to try out this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated (051eb0e930e -> 98d956fd845)

2023-10-24 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 051eb0e930e [MINOR] Add tests on combine parallelism (#9731)
 add 98d956fd845 [HUDI-6977] Upgrade hadoop version from 2.10.1 to 2.10.2 
(#9914)

No new revisions were added by this update.

Summary of changes:
 pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Re: [PR] [HUDI-6977] Upgrade hadoop version from 2.10.1 to 2.10.2 [hudi]

2023-10-24 Thread via GitHub



yihua merged PR #9914:
URL: https://github.com/apache/hudi/pull/9914


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6973] Instantiate HoodieFileGroupRecordBuffer inside new file group reader [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9910:
URL: https://github.com/apache/hudi/pull/9910#issuecomment-1778561669

   
   ## CI report:
   
   * f158692bc1611582566b3bbd76e49d07a290e802 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Compaction error [hudi]

2023-10-24 Thread via GitHub



codope closed issue #9885: [SUPPORT] Compaction error 
URL: https://github.com/apache/hudi/issues/9885


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] AWS Glue Sync fails on a Hudi table with > 25 partitions [hudi]

2023-10-24 Thread via GitHub



ad1happy2go commented on issue #9806:
URL: https://github.com/apache/hudi/issues/9806#issuecomment-1778552572

   @buiducsinh34 @noahtaite Closing this out as PR is merged. Thanks Everybody. 
Feel free to reopen if you still see the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(Lscala/PartialFunction;)Lorg/apache/spark/sql/catalyst/p

2023-10-24 Thread via GitHub



pushpavanthar commented on issue #8614:
URL: https://github.com/apache/hudi/issues/8614#issuecomment-1778547455

   we tried running this on emr-6.7.0 and few other higher labels.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub



yihua commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1371173652


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala:
##
@@ -261,7 +262,8 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase 
with ScalaAssertionSuppo
   }
 
   test("Test MergeInto for MOR table ") {
-withRecordType()(withTempDir {tmp =>
+spark.sql(s"set ${MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT.key} = 0")
+withRecordType()(withTempDir { tmp =>

Review Comment:
   Yes, I'd like to make sure that my changes do not break MERGE INTO on MOR 
tables. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub



danny0405 commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1371170669


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala:
##
@@ -261,7 +262,8 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase 
with ScalaAssertionSuppo
   }
 
   test("Test MergeInto for MOR table ") {
-withRecordType()(withTempDir {tmp =>
+spark.sql(s"set ${MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT.key} = 0")
+withRecordType()(withTempDir { tmp =>

Review Comment:
   Got it, is it related with this change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub



yihua commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1371157540


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala:
##
@@ -261,7 +262,8 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase 
with ScalaAssertionSuppo
   }
 
   test("Test MergeInto for MOR table ") {
-withRecordType()(withTempDir {tmp =>
+spark.sql(s"set ${MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT.key} = 0")
+withRecordType()(withTempDir { tmp =>

Review Comment:
   This is to ensure that for MOR table, log files are written.  Otherwise, the 
MOR table generated by the test may not contain log files, which is not 
different than COW.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Control file sizing during FULL_RECORD bootstrap mode [hudi]

2023-10-24 Thread via GitHub



ad1happy2go commented on issue #9915:
URL: https://github.com/apache/hudi/issues/9915#issuecomment-1778505679

   @fenil25 bulk-insert operation doesn't handle the small file handling, that 
is why you see the file sizes equal to split size. Sp the total number of 
partitions is calculated as `number_of_files * number_of_blocks_in_file`. 
   - One way to handle this case will be running clustering with proper 
configuration to achieve the correct size files. 
   - The other way is to configure the spark configuration 
`spark.sql.files.maxPartitionBytes` while doing bulk-insert which is default 
128 MB in spark. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778503193

   
   ## CI report:
   
   * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN
   * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20467)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6960] Support read partition values from path when schema evolution enabled [hudi]

2023-10-24 Thread via GitHub



danny0405 commented on code in PR #9889:
URL: https://github.com/apache/hudi/pull/9889#discussion_r1371143452


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala:
##
@@ -65,8 +65,11 @@ case class BaseFileOnlyRelation(override val sqlContext: 
SQLContext,
   // For more details please check HUDI-4161
   // NOTE: This override has to mirror semantic of whenever this Relation is 
converted into [[HadoopFsRelation]],
   //   which is currently done for all cases, except when Schema Evolution 
is enabled
-  override protected val shouldExtractPartitionValuesFromPartitionPath: 
Boolean =
-  internalSchemaOpt.isEmpty
+  override protected val shouldExtractPartitionValuesFromPartitionPath: 
Boolean = {
+if (hasSchemaOnRead) {
+  super.needExtractPartitionValuesFromPartitionPath()
+} else true

Review Comment:
   What is exact the behavior change in line 205, can you elaborate a little 
more?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6960] Support read partition values from path when schema evolution enabled [hudi]

2023-10-24 Thread via GitHub



danny0405 commented on code in PR #9889:
URL: https://github.com/apache/hudi/pull/9889#discussion_r1371142864


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##
@@ -220,7 +220,9 @@ abstract class HoodieBaseRelation(val sqlContext: 
SQLContext,
*   partition path, meaning that string value of "2022/01/01" will be 
appended, and not its original
*   representation
*/
-  protected val shouldExtractPartitionValuesFromPartitionPath: Boolean = {
+  protected val shouldExtractPartitionValuesFromPartitionPath: Boolean = 
needExtractPartitionValuesFromPartitionPath()
+
+  protected def needExtractPartitionValuesFromPartitionPath(): Boolean = {
 // Controls whether partition columns (which are the source for the 
partition path values) should

Review Comment:
   Why add a new method name?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6960] Support read partition values from path when schema evolution enabled [hudi]

2023-10-24 Thread via GitHub



danny0405 commented on code in PR #9889:
URL: https://github.com/apache/hudi/pull/9889#discussion_r1371141863


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala:
##
@@ -149,27 +152,10 @@ case class BaseFileOnlyRelation(override val sqlContext: 
SQLContext,
 val enableFileIndex = HoodieSparkConfUtils.getConfigValue(optParams, 
sparkSession.sessionState.conf,
   ENABLE_HOODIE_FILE_INDEX.key, 
ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean
 if (enableFileIndex && globPaths.isEmpty) {
-  // NOTE: There are currently 2 ways partition values could be fetched:
-  //  - Source columns (producing the values used for physical 
partitioning) will be read
-  //  from the data file
-  //  - Values parsed from the actual partition path would be 
appended to the final dataset
-  //
-  //In the former case, we don't need to provide the 
partition-schema to the relation,
-  //therefore we simply stub it w/ empty schema and use full 
table-schema as the one being
-  //read from the data file.

Review Comment:
   Can you add this details info as a comment there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6975] Optimize the code of DayBasedCompactionStrategy [hudi]

2023-10-24 Thread via GitHub



danny0405 commented on code in PR #9911:
URL: https://github.com/apache/hudi/pull/9911#discussion_r1371139632


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/strategy/DayBasedCompactionStrategy.java:
##
@@ -63,21 +60,9 @@ public Comparator getComparator() {
 return comparator;
   }
 
-  @Override
-  public List orderAndFilter(HoodieWriteConfig 
writeConfig,
-  List operations, List 
pendingCompactionPlans) {
-// Iterate through the operations and accept operations as long as we are 
within the configured target partitions
-// limit
-return operations.stream()
-
.collect(Collectors.groupingBy(HoodieCompactionOperation::getPartitionPath)).entrySet().stream()
-
.sorted(Map.Entry.comparingByKey(comparator)).limit(writeConfig.getTargetPartitionsPerDayBasedCompaction())
-.flatMap(e -> e.getValue().stream()).collect(Collectors.toList());
-  }
-
   @Override
   public List filterPartitionPaths(HoodieWriteConfig writeConfig, 
List allPartitionPaths) {
-return allPartitionPaths.stream().map(partition -> partition.replace("/", 
"-"))
-.sorted(Comparator.reverseOrder()).map(partitionPath -> 
partitionPath.replace("-", "/"))
+return allPartitionPaths.stream().sorted(comparator)
 .collect(Collectors.toList()).subList(0, 
Math.min(allPartitionPaths.size(),

Review Comment:
   Why we subList its original size, I'm confused.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Add table name and range msg for streaming reads logs [hudi]

2023-10-24 Thread via GitHub



danny0405 commented on code in PR #9912:
URL: https://github.com/apache/hudi/pull/9912#discussion_r1371137154


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadMonitoringFunction.java:
##
@@ -226,9 +226,9 @@ public void 
monitorDirAndForwardSplits(SourceContext cont
 this.issuedOffset = result.getOffset();
 LOG.info("\n"
 + "\n"
-+ "-- consumed to instant: {}\n"
++ "-- consumed {} to instant: {}\n"
 + "",
-this.issuedInstant);
+conf.getString(FlinkOptions.TABLE_NAME), this.issuedInstant);

Review Comment:
   I would like it to be:
   
   -- table: xxx
   -- consumed to instant: xxx



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Add table name and range msg for streaming reads logs [hudi]

2023-10-24 Thread via GitHub



danny0405 commented on code in PR #9912:
URL: https://github.com/apache/hudi/pull/9912#discussion_r1371135529


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java:
##
@@ -57,6 +59,15 @@ public String getEndInstant() {
 
   public abstract boolean isInRange(String instant);
 
+  @Override
+  public String toString() {
+return "InstantRange{"

Review Comment:
   The start or end range may be null.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]

2023-10-24 Thread via GitHub



danny0405 commented on code in PR #9892:
URL: https://github.com/apache/hudi/pull/9892#discussion_r1371133664


##
hudi-common/src/main/java/org/apache/hudi/common/model/DefaultHoodieRecordPayload.java:
##
@@ -86,30 +86,26 @@ public Option getInsertValue(Schema schema, 
Properties properties
 GenericRecord incomingRecord = HoodieAvroUtils.bytesToAvro(recordBytes, 
schema);
 eventTime = updateEventTime(incomingRecord, properties);
 
-return isDeleteRecord(incomingRecord, properties) ? Option.empty() : 
Option.of(incomingRecord);
+return isDeleted(schema, properties) ? Option.empty() : 
Option.of(incomingRecord);
   }
 
-  /**
-   * @param genericRecord instance of {@link GenericRecord} of interest.
-   * @param properties payload related properties
-   * @returns {@code true} if record represents a delete record. {@code false} 
otherwise.
-   */
-  protected boolean isDeleteRecord(GenericRecord genericRecord, Properties 
properties) {
-final String deleteKey = properties.getProperty(DELETE_KEY);
+  @Override
+  protected boolean isDeleteRecord(GenericRecord record, Properties props) {
+final String deleteKey = props.getProperty(DELETE_KEY);
 if (StringUtils.isNullOrEmpty(deleteKey)) {
-  return isDeleteRecord(genericRecord);
+  return super.isDeleteRecord(record, props);

Review Comment:
   Is this line the actualy fix, I didn't see the props got used by the super 
method, so do we still need to pass around all the props here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]

2023-10-24 Thread via GitHub



danny0405 commented on code in PR #9892:
URL: https://github.com/apache/hudi/pull/9892#discussion_r1371132703


##
hudi-common/src/main/java/org/apache/hudi/common/model/DefaultHoodieRecordPayload.java:
##
@@ -45,12 +45,12 @@ public class DefaultHoodieRecordPayload extends 
OverwriteWithLatestAvroPayload {
   public static final String DELETE_MARKER = "hoodie.payload.delete.marker";
   private Option eventTime = Option.empty();
 
-  public DefaultHoodieRecordPayload(GenericRecord record, Comparable 
orderingVal) {
-super(record, orderingVal);
+  public DefaultHoodieRecordPayload(GenericRecord record, Comparable 
orderingVal, Properties props) {
+super(record, orderingVal, props);
   }

Review Comment:
   The source of the props seems a chaos, I already saw several ways how it was 
produced:
   
   1. `config.getPayloadConfig().getProps()` in `HoodieMergeHandle`;
   2. 
`payloadProps.setProperty(HoodiePayloadProps.PAYLOAD_ORDERING_FIELD_PROP_KEY, 
preCombineField);` in `HoodieFileSliceReader`;
   3. `config.getProps()` in `HoodieIndexUtils`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]

2023-10-24 Thread via GitHub



danny0405 commented on code in PR #9892:
URL: https://github.com/apache/hudi/pull/9892#discussion_r1371109726


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/PayloadCreation.java:
##
@@ -43,14 +44,17 @@ public class PayloadCreation implements Serializable {
   private static final long serialVersionUID = 1L;
 
   private final boolean shouldCombine;
+  private final boolean shouldUsePropsForPayload;
   private final Constructor constructor;
   private final String preCombineField;
 
   private PayloadCreation(
   boolean shouldCombine,
+  boolean shouldUsePropsForPayload,
   Constructor constructor,
   @Nullable String preCombineField) {
 this.shouldCombine = shouldCombine;
+this.shouldUsePropsForPayload = shouldUsePropsForPayload;

Review Comment:
   `shouldUsePropsForPayload` should be always true?



##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/PayloadCreation.java:
##
@@ -60,34 +64,63 @@ public static PayloadCreation instance(Configuration conf) 
throws Exception {
 boolean needCombine = conf.getBoolean(FlinkOptions.PRE_COMBINE)
 || 
WriteOperationType.fromValue(conf.getString(FlinkOptions.OPERATION)) == 
WriteOperationType.UPSERT;
 boolean shouldCombine = needCombine && preCombineField != null;
+boolean shouldUsePropsForPayload = true;
 
-final Class[] argTypes;
-final Constructor constructor;
+Class[] argTypes;
+Constructor constructor;
 if (shouldCombine) {
-  argTypes = new Class[] {GenericRecord.class, Comparable.class};
+  argTypes = new Class[] {GenericRecord.class, Comparable.class, 
Properties.class};
 } else {
-  argTypes = new Class[] {Option.class};
+  argTypes = new Class[] {Option.class, Properties.class};
 }
 final String clazz = conf.getString(FlinkOptions.PAYLOAD_CLASS_NAME);
-constructor = ReflectionUtils.getClass(clazz).getConstructor(argTypes);
-return new PayloadCreation(shouldCombine, constructor, preCombineField);
+try {
+  constructor = ReflectionUtils.getClass(clazz).getConstructor(argTypes);
+} catch (NoSuchMethodException e) {
+  shouldUsePropsForPayload = false;
+  if (shouldCombine) {
+argTypes = new Class[] {GenericRecord.class, Comparable.class};
+  } else {
+argTypes = new Class[] {Option.class};
+  }
+  constructor = ReflectionUtils.getClass(clazz).getConstructor(argTypes);
+}
+return new PayloadCreation(shouldCombine, shouldUsePropsForPayload, 
constructor, preCombineField);
+  }
+
+  public static Properties extractPropsFromConfiguration(Configuration config) 
{
+Properties props = new Properties();

Review Comment:
   If all we want is payload properties, you can use 
`StreamerUtil.getPayloadConfig`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Add table name and range msg for streaming reads logs [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9912:
URL: https://github.com/apache/hudi/pull/9912#issuecomment-1778466955

   
   ## CI report:
   
   * 7f6535290896455bb3312e7203f2eafa69109f05 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20457)
 
   * aadc5fbc31b83cfff275fee66618071b0bc9e76d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20471)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Add table name and range msg for streaming reads logs [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9912:
URL: https://github.com/apache/hudi/pull/9912#issuecomment-1778462077

   
   ## CI report:
   
   * 7f6535290896455bb3312e7203f2eafa69109f05 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20457)
 
   * aadc5fbc31b83cfff275fee66618071b0bc9e76d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778461940

   
   ## CI report:
   
   * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN
   * d96a7423b1c1bae13148744547726ed95ee5c6b7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20465)
 
   * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20467)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub



danny0405 commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1371105049


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala:
##
@@ -261,7 +262,8 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase 
with ScalaAssertionSuppo
   }
 
   test("Test MergeInto for MOR table ") {
-withRecordType()(withTempDir {tmp =>
+spark.sql(s"set ${MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT.key} = 0")
+withRecordType()(withTempDir { tmp =>

Review Comment:
   Why this change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Add table name and range msg for streaming reads logs [hudi]

2023-10-24 Thread via GitHub



zhuanshenbsj1 commented on code in PR #9912:
URL: https://github.com/apache/hudi/pull/9912#discussion_r1371102567


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java:
##
@@ -34,10 +34,12 @@ public abstract class InstantRange implements Serializable {
 
   protected final String startInstant;
   protected final String endInstant;
+  protected final String rangeType;
 
-  public InstantRange(String startInstant, String endInstant) {
+  public InstantRange(String startInstant, String endInstant, String 
rangeType) {
 this.startInstant = startInstant;
 this.endInstant = endInstant;
+this.rangeType = rangeType;

Review Comment:
   Adjust as u say.



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java:
##
@@ -57,6 +59,15 @@ public String getEndInstant() {
 
   public abstract boolean isInRange(String instant);
 
+  @Override
+  public String toString() {
+return "InstantRange{"

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9883:
URL: https://github.com/apache/hudi/pull/9883#issuecomment-1778455957

   
   ## CI report:
   
   * 985e9f099aff341d7d0cec4384ef82b7dcdd4de8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20469)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6960] Support read partition values from path when schema evolution enabled [hudi]

2023-10-24 Thread via GitHub



wecharyu commented on code in PR #9889:
URL: https://github.com/apache/hudi/pull/9889#discussion_r1371097371


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala:
##
@@ -149,27 +152,10 @@ case class BaseFileOnlyRelation(override val sqlContext: 
SQLContext,
 val enableFileIndex = HoodieSparkConfUtils.getConfigValue(optParams, 
sparkSession.sessionState.conf,
   ENABLE_HOODIE_FILE_INDEX.key, 
ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean
 if (enableFileIndex && globPaths.isEmpty) {
-  // NOTE: There are currently 2 ways partition values could be fetched:
-  //  - Source columns (producing the values used for physical 
partitioning) will be read
-  //  from the data file
-  //  - Values parsed from the actual partition path would be 
appended to the final dataset
-  //
-  //In the former case, we don't need to provide the 
partition-schema to the relation,
-  //therefore we simply stub it w/ empty schema and use full 
table-schema as the one being
-  //read from the data file.

Review Comment:
   Got your point. The change here is because baseRelation will be converted to 
HadoopFsRelation only when `baseRelation.hasSchemaOnRead` is **false**:
   
https://github.com/apache/hudi/blob/65dd645b487a61fbca7e4e4b849d1f2f1ec143f9/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala#L328-L332
   
   In this case `shouldExtractPartitionValuesFromPartitionPath` is true, this 
is just a code simplify.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-6900) TestInsertTable "Test Bulk Insert Into Consistent Hashing Bucket Index Table" is failing continuously

2023-10-24 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6900.

Fix Version/s: 1.0.0
   Resolution: Fixed

Fixed via master branch: 65dd645b487a61fbca7e4e4b849d1f2f1ec143f9

> TestInsertTable "Test Bulk Insert Into Consistent Hashing Bucket Index Table" 
> is failing continuously
> -
>
> Key: HUDI-6900
> URL: https://issues.apache.org/jira/browse/HUDI-6900
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>
> The test is failing on travis CI but can not reproduce in local, need some 
> time to debug the reasons.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated: [MINOR] Add tests on combine parallelism (#9731)

2023-10-24 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 051eb0e930e [MINOR] Add tests on combine parallelism (#9731)
051eb0e930e is described below

commit 051eb0e930e983dd4118abec01e10d9b01f91ca0
Author: Y Ethan Guo 
AuthorDate: Tue Oct 24 20:19:08 2023 -0700

[MINOR] Add tests on combine parallelism (#9731)
---
 .../hudi/table/action/commit/BaseWriteHelper.java  | 11 +--
 .../table/action/commit/TestWriterHelperBase.java  | 90 ++
 .../table/action/commit/TestSparkWriteHelper.java  | 76 ++
 .../common/testutils/HoodieCommonTestHarness.java  | 11 ++-
 4 files changed, 180 insertions(+), 8 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseWriteHelper.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseWriteHelper.java
index 8d8978927f6..b5edc7878f9 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseWriteHelper.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseWriteHelper.java
@@ -27,7 +27,6 @@ import org.apache.hudi.common.util.HoodieRecordUtils;
 import org.apache.hudi.exception.HoodieUpsertException;
 import org.apache.hudi.index.HoodieIndex;
 import org.apache.hudi.table.HoodieTable;
-
 import org.apache.hudi.table.action.HoodieWriteMetadata;
 
 import java.time.Duration;
@@ -48,12 +47,9 @@ public abstract class BaseWriteHelper extends 
ParallelismHelper 
executor,
   WriteOperationType operationType) {
 try {
-  int targetParallelism =
-  deduceShuffleParallelism(inputRecords, configuredShuffleParallelism);
-
   // De-dupe/merge if needed
   I dedupedRecords =
-  combineOnCondition(shouldCombine, inputRecords, targetParallelism, 
table);
+  combineOnCondition(shouldCombine, inputRecords, 
configuredShuffleParallelism, table);
 
   Instant lookupBegin = Instant.now();
   I taggedRecords = dedupedRecords;
@@ -79,8 +75,9 @@ public abstract class BaseWriteHelper extends 
ParallelismHelper 
table);
 
   public I combineOnCondition(
-  boolean condition, I records, int parallelism, HoodieTable 
table) {
-return condition ? deduplicateRecords(records, table, parallelism) : 
records;
+  boolean condition, I records, int configuredParallelism, HoodieTable table) {
+int targetParallelism = deduceShuffleParallelism(records, 
configuredParallelism);
+return condition ? deduplicateRecords(records, table, targetParallelism) : 
records;
   }
 
   /**
diff --git 
a/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/table/action/commit/TestWriterHelperBase.java
 
b/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/table/action/commit/TestWriterHelperBase.java
new file mode 100644
index 000..2d43b414608
--- /dev/null
+++ 
b/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/table/action/commit/TestWriterHelperBase.java
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.table.action.commit;
+
+import org.apache.hudi.common.data.HoodieData;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.testutils.HoodieCommonTestHarness;
+import org.apache.hudi.table.HoodieTable;
+
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.CsvSource;
+
+import java.io.IOException;
+import java.util.List;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+/**
+ * Tests for write helpers
+ */
+public abstract class TestWriterHelperBase extends HoodieCommonTestHarness {
+  private static int runNo = 0;
+  protected final BaseWriteHelper writeHelper;
+  protected HoodieEngineContext context;
+  protected HoodieTable table;
+  protected I inputRecord

Re: [PR] [MINOR] Add tests on combine parallelism [hudi]

2023-10-24 Thread via GitHub



nsivabalan merged PR #9731:
URL: https://github.com/apache/hudi/pull/9731


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]

2023-10-24 Thread via GitHub



nsivabalan commented on code in PR #9892:
URL: https://github.com/apache/hudi/pull/9892#discussion_r1371085714


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroPayload.java:
##
@@ -39,11 +42,19 @@ public class HoodieAvroPayload implements 
HoodieRecordPayload
   private final Comparable orderingVal;
 
   public HoodieAvroPayload(GenericRecord record, Comparable orderingVal) {
+this(record, orderingVal, EMPTY_PROPS);

Review Comment:
   shouldn't we mark these as deprecated ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6877] Fix avro read issue after ALTER TABLE RENAME DDL on Spark3_1 [hudi]

2023-10-24 Thread via GitHub



voonhous commented on code in PR #9752:
URL: https://github.com/apache/hudi/pull/9752#discussion_r1371085479


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieDataBlock.java:
##
@@ -115,6 +114,35 @@ public byte[] getContentBytes() throws IOException {
 return serializeRecords(records.get());
   }
 
+  private Schema getReaderSchema(Option readerSchemaOpt) {
+Schema writerSchema = getWriterSchema(super.getLogBlockHeader());
+// If no reader-schema has been provided assume writer-schema as one
+if (!readerSchemaOpt.isPresent()) {
+  return writerSchema;
+}
+
+// Handle table renames when there are still log files
+Schema readerSchema = readerSchemaOpt.get();
+if (isHandleDifferingNamespaceRequired(readerSchema, writerSchema)) {
+  return writerSchema;
+} else {
+  return readerSchema;
+}
+  }
+
+  /**
+   * Spark3.1 uses avro:1.8.2, which matches fields by their fully qualified 
name. If namespaces are differs, reads will fail for fields that have the same 
name and type, but differing name(spaces).
+   * Such cases can arise when an ALTER-TABLE-RENAME ddl is performed.
+   *
+   * @param readerSchema the reader schema
+   * @param writerSchema the writer schema
+   * @return boolean if handling of differing namespaces between reader and 
writer schema are required
+   */
+  private static boolean isHandleDifferingNamespaceRequired(Schema 
readerSchema, Schema writerSchema) {
+return 
readerSchema.getClass().getPackage().getImplementationVersion().compareTo("1.8.2")
 <= 0
+&& !readerSchema.getName().equals(writerSchema.getName());
+  }

Review Comment:
   Agreed, the fix here was written with  backwards compatibility in mind. For 
tables that already have log files that were written in a certain format + 
namespace and have yet to be compacted, it is not realistic to modify those 
name spaces block by block. 
   
   Given that this is an avro internal issue for lower avro version, and will 
only happen when performing merge. I thought it will be appropriate to put it 
here while reading log files.
   
   There are a few ways to fix this like you mentioned and they are:
   
   1. `ALTER-TABLE-RENAME-DDL` will rename the hive table name, but will not 
perform any internal renames. i.e. `hoodie.properties` will not be re-written, 
schema will hence be consistent.
   2. Block `ALTER-TABLE-RENAME-DDL` if there are uncompacted log files; 
ALTER-TABLE-RENAME-DDL will be IO intensive, need to scan through all latest 
file slices to see if there are any uncompacted filegroups. log file list > 0 
(akin to running a plan-compaction execution, and if compaction operations is 
not empty, we will reject the rename)
   3. Fix it while reading as seen in the fix here (code is ugly and we are 
coupling business logic + with dependency version)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6975] Optimize the code of DayBasedCompactionStrategy [hudi]

2023-10-24 Thread via GitHub



ksmou commented on code in PR #9911:
URL: https://github.com/apache/hudi/pull/9911#discussion_r1371082887


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/strategy/DayBasedCompactionStrategy.java:
##
@@ -63,21 +60,9 @@ public Comparator getComparator() {
 return comparator;
   }
 
-  @Override
-  public List orderAndFilter(HoodieWriteConfig 
writeConfig,
-  List operations, List 
pendingCompactionPlans) {
-// Iterate through the operations and accept operations as long as we are 
within the configured target partitions
-// limit
-return operations.stream()
-
.collect(Collectors.groupingBy(HoodieCompactionOperation::getPartitionPath)).entrySet().stream()
-
.sorted(Map.Entry.comparingByKey(comparator)).limit(writeConfig.getTargetPartitionsPerDayBasedCompaction())
-.flatMap(e -> e.getValue().stream()).collect(Collectors.toList());
-  }
-
   @Override
   public List filterPartitionPaths(HoodieWriteConfig writeConfig, 
List allPartitionPaths) {
-return allPartitionPaths.stream().map(partition -> partition.replace("/", 
"-"))
-.sorted(Comparator.reverseOrder()).map(partitionPath -> 
partitionPath.replace("-", "/"))
+return allPartitionPaths.stream().sorted(comparator)
 .collect(Collectors.toList()).subList(0, 
Math.min(allPartitionPaths.size(),

Review Comment:
   The test failures are not related to this change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-6979) support EventTimeBasedCompactionStrategy

2023-10-24 Thread Kong Wei (Jira)

Kong Wei created HUDI-6979:
--

 Summary: support EventTimeBasedCompactionStrategy
 Key: HUDI-6979
 URL: https://issues.apache.org/jira/browse/HUDI-6979
 Project: Apache Hudi
  Issue Type: New Feature
  Components: compaction
Reporter: Kong Wei
Assignee: Kong Wei


The current compaction strategies are based on the logfile size, the number of 
logfile files, etc. The data time of the RO table generated by these strategies 
is uncontrollable. Hudi also has a DayBased strategy, but it relies on day 
based partition path and the time granularity is coarse.


The *EventTimeBasedCompactionStrategy* strategy can generate event 
time-friendly RO tables, whether it is day based partition or not. For example, 
the strategy can select all logfiles whose data time is before 3 am for 
compaction, so that the generated RO table data is before 3 am. If we just want 
to query data before 3 am, we can just query the RO table which is much faster.

With the strategy, I think we can expand the application scenarios of RO tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-6969] Add speed limit for stream read [hudi]

2023-10-24 Thread via GitHub



danny0405 commented on code in PR #9904:
URL: https://github.com/apache/hudi/pull/9904#discussion_r137106


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java:
##
@@ -269,6 +269,9 @@ public Result inputSplits(
 Result hollowSplits = getHollowInputSplits(metaClient, 
metaClient.getHadoopConf(), issuedInstant, issuedOffset, commitTimeline, 
cdcEnabled);
 
 List instants = filterInstantsWithRange(commitTimeline, 
issuedInstant);
+int instantLimit = 
this.conf.getInteger(FlinkOptions.READ_COMMITS_LIMIT,Integer.MAX_VALUE);
+instants = instants.subList(0, Math.min(instantLimit, instants.size()));
+

Review Comment:
   Can we add some tests for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-6968) remove block logical in BulkInsertWriteFunction#open

2023-10-24 Thread Jing Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779295#comment-17779295
 ] 

Jing Zhang commented on HUDI-6968:
--

Fixed via master branch: f05b5fc9db38e0bc4ccc2941cccf049991b67db2

> remove block logical in BulkInsertWriteFunction#open
> 
>
> Key: HUDI-6968
> URL: https://issues.apache.org/jira/browse/HUDI-6968
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jing Zhang
>Priority: Trivial
>
> See more discussion in [PR9896|https://github.com/apache/hudi/pull/9896].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-6968) remove block logical in BulkInsertWriteFunction#open

2023-10-24 Thread Jing Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhang closed HUDI-6968.

Fix Version/s: 1.0.0
   Resolution: Fixed

> remove block logical in BulkInsertWriteFunction#open
> 
>
> Key: HUDI-6968
> URL: https://issues.apache.org/jira/browse/HUDI-6968
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jing Zhang
>Priority: Trivial
> Fix For: 1.0.0
>
>
> See more discussion in [PR9896|https://github.com/apache/hudi/pull/9896].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-6975] Optimize the code of DayBasedCompactionStrategy [hudi]

2023-10-24 Thread via GitHub



danny0405 commented on code in PR #9911:
URL: https://github.com/apache/hudi/pull/9911#discussion_r1371010223


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/strategy/DayBasedCompactionStrategy.java:
##
@@ -63,21 +60,9 @@ public Comparator getComparator() {
 return comparator;
   }
 
-  @Override
-  public List orderAndFilter(HoodieWriteConfig 
writeConfig,
-  List operations, List 
pendingCompactionPlans) {
-// Iterate through the operations and accept operations as long as we are 
within the configured target partitions
-// limit
-return operations.stream()
-
.collect(Collectors.groupingBy(HoodieCompactionOperation::getPartitionPath)).entrySet().stream()
-
.sorted(Map.Entry.comparingByKey(comparator)).limit(writeConfig.getTargetPartitionsPerDayBasedCompaction())
-.flatMap(e -> e.getValue().stream()).collect(Collectors.toList());
-  }
-
   @Override
   public List filterPartitionPaths(HoodieWriteConfig writeConfig, 
List allPartitionPaths) {
-return allPartitionPaths.stream().map(partition -> partition.replace("/", 
"-"))
-.sorted(Comparator.reverseOrder()).map(partitionPath -> 
partitionPath.replace("-", "/"))
+return allPartitionPaths.stream().sorted(comparator)
 .collect(Collectors.toList()).subList(0, 
Math.min(allPartitionPaths.size(),

Review Comment:
   Can you check the test failures.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6960] Support read partition values from path when schema evolution enabled [hudi]

2023-10-24 Thread via GitHub



danny0405 commented on code in PR #9889:
URL: https://github.com/apache/hudi/pull/9889#discussion_r1371009429


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala:
##
@@ -149,27 +152,10 @@ case class BaseFileOnlyRelation(override val sqlContext: 
SQLContext,
 val enableFileIndex = HoodieSparkConfUtils.getConfigValue(optParams, 
sparkSession.sessionState.conf,
   ENABLE_HOODIE_FILE_INDEX.key, 
ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean
 if (enableFileIndex && globPaths.isEmpty) {
-  // NOTE: There are currently 2 ways partition values could be fetched:
-  //  - Source columns (producing the values used for physical 
partitioning) will be read
-  //  from the data file
-  //  - Values parsed from the actual partition path would be 
appended to the final dataset
-  //
-  //In the former case, we don't need to provide the 
partition-schema to the relation,
-  //therefore we simply stub it w/ empty schema and use full 
table-schema as the one being
-  //read from the data file.

Review Comment:
   But after your change, the partiton shema is always resolved from partiton 
path which looks like a regression?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6960] Support read partition values from path when schema evolution enabled [hudi]

2023-10-24 Thread via GitHub



danny0405 commented on code in PR #9889:
URL: https://github.com/apache/hudi/pull/9889#discussion_r1368068891


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala:
##
@@ -149,27 +152,10 @@ case class BaseFileOnlyRelation(override val sqlContext: 
SQLContext,
 val enableFileIndex = HoodieSparkConfUtils.getConfigValue(optParams, 
sparkSession.sessionState.conf,
   ENABLE_HOODIE_FILE_INDEX.key, 
ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean
 if (enableFileIndex && globPaths.isEmpty) {
-  // NOTE: There are currently 2 ways partition values could be fetched:
-  //  - Source columns (producing the values used for physical 
partitioning) will be read
-  //  from the data file
-  //  - Values parsed from the actual partition path would be 
appended to the final dataset
-  //
-  //In the former case, we don't need to provide the 
partition-schema to the relation,
-  //therefore we simply stub it w/ empty schema and use full 
table-schema as the one being
-  //read from the data file.

Review Comment:
   Can you ensure that HUDI-4161 been solved after this change? Can you 
elaborate why `shouldExtractPartitionValuesFromPartitionPath` should be false 
after schema evolution?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778351985

   
   ## CI report:
   
   * b8bc65dc87cfd1305634bf16f96a97944ce85816 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20432)
 
   * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN
   * d96a7423b1c1bae13148744547726ed95ee5c6b7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20465)
 
   * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20467)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9883:
URL: https://github.com/apache/hudi/pull/9883#issuecomment-1778352024

   
   ## CI report:
   
   * c140ff462f58b649d45c782ce072b683cd908c1c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20441)
 
   * 985e9f099aff341d7d0cec4384ef82b7dcdd4de8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20469)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6959] Bulk insert V2 do not rollback failed instant on abort [hudi]

2023-10-24 Thread via GitHub



boneanxs commented on PR #9887:
URL: https://github.com/apache/hudi/pull/9887#issuecomment-1778351293

   @danny0405 Yea, sure, will raise the pr soon


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6959] Bulk insert V2 do not rollback failed instant on abort [hudi]

2023-10-24 Thread via GitHub



danny0405 commented on PR #9887:
URL: https://github.com/apache/hudi/pull/9887#issuecomment-1778350430

   @stream2000 @boneanxs Merge it first because it looks like a bug fix. Can 
you finalize it with following up PRs with more tests or probable the correct 
fix with `#abort`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT]flink-sql write hudi use TIMESTAMP, when hive query, it get time+8h question, use TIMESTAMP_LTZ, the hive schema is bigint but timestamp [hudi]

2023-10-24 Thread via GitHub



danny0405 commented on issue #9864:
URL: https://github.com/apache/hudi/issues/9864#issuecomment-1778351080

   > but TIMESTAMP cannot be changed to long 
   
   What do you mean by changed to long?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6959] Bulk insert V2 do not rollback failed instant on abort [hudi]

2023-10-24 Thread via GitHub



boneanxs commented on PR #9887:
URL: https://github.com/apache/hudi/pull/9887#issuecomment-1778346251

   > we can confirm that datasource v2 won't waiting for all subtasks to be 
canceled before calling 
`org.apache.hudi.table.action.commit.BulkInsertDataInternalWriterHelper#abort`
   
   should be 
`org.apache.hudi.spark3.internal.HoodieDataSourceInternalBatchWrite#abort` 
instead of 
`org.apache.hudi.table.action.commit.BulkInsertDataInternalWriterHelper#abort`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9883:
URL: https://github.com/apache/hudi/pull/9883#issuecomment-1778343885

   
   ## CI report:
   
   * c140ff462f58b649d45c782ce072b683cd908c1c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20441)
 
   * 985e9f099aff341d7d0cec4384ef82b7dcdd4de8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778343787

   
   ## CI report:
   
   * b8bc65dc87cfd1305634bf16f96a97944ce85816 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20432)
 
   * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN
   * d96a7423b1c1bae13148744547726ed95ee5c6b7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20465)
 
   * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6975] Optimize the implementation of DayBasedCompactionStrategy [hudi]

2023-10-24 Thread via GitHub



ksmou commented on code in PR #9911:
URL: https://github.com/apache/hudi/pull/9911#discussion_r1370998924


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/strategy/DayBasedCompactionStrategy.java:
##
@@ -63,21 +60,9 @@ public Comparator getComparator() {
 return comparator;
   }
 
-  @Override
-  public List orderAndFilter(HoodieWriteConfig 
writeConfig,
-  List operations, List 
pendingCompactionPlans) {
-// Iterate through the operations and accept operations as long as we are 
within the configured target partitions
-// limit
-return operations.stream()
-
.collect(Collectors.groupingBy(HoodieCompactionOperation::getPartitionPath)).entrySet().stream()
-
.sorted(Map.Entry.comparingByKey(comparator)).limit(writeConfig.getTargetPartitionsPerDayBasedCompaction())
-.flatMap(e -> e.getValue().stream()).collect(Collectors.toList());
-  }
-
   @Override
   public List filterPartitionPaths(HoodieWriteConfig writeConfig, 
List allPartitionPaths) {
-return allPartitionPaths.stream().map(partition -> partition.replace("/", 
"-"))
-.sorted(Comparator.reverseOrder()).map(partitionPath -> 
partitionPath.replace("-", "/"))
+return allPartitionPaths.stream().sorted(comparator)
 .collect(Collectors.toList()).subList(0, 
Math.min(allPartitionPaths.size(),

Review Comment:
   Yes. mainly remove the redundant orderAndFilter operation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6959] Bulk insert V2 do not rollback failed instant on abort [hudi]

2023-10-24 Thread via GitHub



stream2000 commented on code in PR #9887:
URL: https://github.com/apache/hudi/pull/9887#discussion_r1370998616


##
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/DataSourceInternalWriterHelper.java:
##
@@ -97,7 +97,6 @@ public void commit(List writeStatuses) {
 
   public void abort() {
 LOG.error("Commit " + instantTime + " aborted ");
-writeClient.rollback(instantTime);

Review Comment:
   Will add a test in the next PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-6959) Do not rollback current instant when bulk insert as row failed

2023-10-24 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6959.

Resolution: Fixed

Fixed via master branch: 65dd645b487a61fbca7e4e4b849d1f2f1ec143f9

> Do not rollback current instant when bulk insert as row failed
> --
>
> Key: HUDI-6959
> URL: https://issues.apache.org/jira/browse/HUDI-6959
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Qijun Fu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.14.1
>
>
> When org.apache.hudi.spark3.internal.HoodieDataSourceInternalBatchWrite#abort 
> is called, all the subtasks may not have already been canceled. So if we 
> rollback current instant immediately, there may be new files been written 
> after rollback scheduled, which will cause dirty data.
>  
> We should rollback the failed instant using common mechanism eager and lazy 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6959) Do not rollback current instant when bulk insert as row failed

2023-10-24 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-6959:
-
Fix Version/s: 1.0.0
   0.14.1

> Do not rollback current instant when bulk insert as row failed
> --
>
> Key: HUDI-6959
> URL: https://issues.apache.org/jira/browse/HUDI-6959
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Qijun Fu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.14.1
>
>
> When org.apache.hudi.spark3.internal.HoodieDataSourceInternalBatchWrite#abort 
> is called, all the subtasks may not have already been canceled. So if we 
> rollback current instant immediately, there may be new files been written 
> after rollback scheduled, which will cause dirty data.
>  
> We should rollback the failed instant using common mechanism eager and lazy 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated: [HUDI-6959] Bulk insert as row do not rollback failed instant on abort (#9887)

2023-10-24 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 65dd645b487 [HUDI-6959] Bulk insert as row do not rollback failed 
instant on abort (#9887)
65dd645b487 is described below

commit 65dd645b487a61fbca7e4e4b849d1f2f1ec143f9
Author: StreamingFlames <18889897...@163.com>
AuthorDate: Tue Oct 24 20:36:28 2023 -0500

[HUDI-6959] Bulk insert as row do not rollback failed instant on abort 
(#9887)
---
 .../java/org/apache/hudi/internal/DataSourceInternalWriterHelper.java  | 1 -
 .../src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala | 3 +--
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/DataSourceInternalWriterHelper.java
 
b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/DataSourceInternalWriterHelper.java
index 4ad6c2066a3..58bb3e4d608 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/DataSourceInternalWriterHelper.java
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/DataSourceInternalWriterHelper.java
@@ -97,7 +97,6 @@ public class DataSourceInternalWriterHelper {
 
   public void abort() {
 LOG.error("Commit " + instantTime + " aborted ");
-writeClient.rollback(instantTime);
 writeClient.close();
   }
 
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala
index 14bc84948c1..8cc107a24fb 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala
@@ -1714,8 +1714,7 @@ class TestInsertTable extends HoodieSparkSqlTestBase {
 }
   }
 
-  // [HUDI-6900] TestInsertTable "Test Bulk Insert Into Consistent Hashing 
Bucket Index Table" is failing continuously
-  ignore("Test Bulk Insert Into Consistent Hashing Bucket Index Table") {
+  test("Test Bulk Insert Into Consistent Hashing Bucket Index Table") {
 withSQLConf("hoodie.datasource.write.operation" -> "bulk_insert") {
   Seq("false", "true").foreach { bulkInsertAsRow =>
 withTempDir { tmp =>

Re: [PR] [HUDI-6959] Bulk insert V2 do not rollback failed instant on abort [hudi]

2023-10-24 Thread via GitHub



danny0405 merged PR #9887:
URL: https://github.com/apache/hudi/pull/9887


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778336139

   
   ## CI report:
   
   * b8bc65dc87cfd1305634bf16f96a97944ce85816 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20432)
 
   * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN
   * d96a7423b1c1bae13148744547726ed95ee5c6b7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Add tests on combine parallelism [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9731:
URL: https://github.com/apache/hudi/pull/9731#issuecomment-1778335971

   
   ## CI report:
   
   * 047941b66ee52a99f626fd0dadb72581d9855385 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19966)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6975] Optimize the implementation of DayBasedCompactionStrategy [hudi]

2023-10-24 Thread via GitHub



danny0405 commented on code in PR #9911:
URL: https://github.com/apache/hudi/pull/9911#discussion_r1370996694


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/strategy/DayBasedCompactionStrategy.java:
##
@@ -63,21 +60,9 @@ public Comparator getComparator() {
 return comparator;
   }
 
-  @Override
-  public List orderAndFilter(HoodieWriteConfig 
writeConfig,
-  List operations, List 
pendingCompactionPlans) {
-// Iterate through the operations and accept operations as long as we are 
within the configured target partitions
-// limit
-return operations.stream()
-
.collect(Collectors.groupingBy(HoodieCompactionOperation::getPartitionPath)).entrySet().stream()
-
.sorted(Map.Entry.comparingByKey(comparator)).limit(writeConfig.getTargetPartitionsPerDayBasedCompaction())
-.flatMap(e -> e.getValue().stream()).collect(Collectors.toList());
-  }
-
   @Override
   public List filterPartitionPaths(HoodieWriteConfig writeConfig, 
List allPartitionPaths) {
-return allPartitionPaths.stream().map(partition -> partition.replace("/", 
"-"))
-.sorted(Comparator.reverseOrder()).map(partitionPath -> 
partitionPath.replace("-", "/"))
+return allPartitionPaths.stream().sorted(comparator)
 .collect(Collectors.toList()).subList(0, 
Math.min(allPartitionPaths.size(),

Review Comment:
   Seems a pure code optimization?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-6929) Lazy loading dynamically for CompletionTimeQueryView

2023-10-24 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6929.

Resolution: Fixed

Fixed via master branch: bb8fc3e9f632a1fc3647fda63d482849355df2b7

> Lazy loading dynamically for CompletionTimeQueryView
> 
>
> Key: HUDI-6929
> URL: https://issues.apache.org/jira/browse/HUDI-6929
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6962) Correct the behavior of bulk insert for NB-CC

2023-10-24 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-6962:
-
Fix Version/s: 1.0.0

> Correct the behavior of bulk insert for NB-CC 
> --
>
> Key: HUDI-6962
> URL: https://issues.apache.org/jira/browse/HUDI-6962
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Jing Zhang
>Assignee: Jing Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> How to handle the case if the multiple writer contains a job with bulk insert 
> operation?
> 1. Generated file group id: Generate a fixed file group ID because other jobs 
> will use the fixed file group id suffix instead of random uuid suffix. The 
> behavior needs to be consistent to prevent later writer jobs from writing the 
> records with same primary key to different file groups.
> 2.Deal with the transaction: The conflict resolution of bulk insert could not 
> defer to the compaction phase. Because bulk insert writers flush data into 
> base files, if there are multiple bulk insert job, there might exists 
> multiple base files in the same bucket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-6962) Correct the behavior of bulk insert for NB-CC

2023-10-24 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6962.

Resolution: Fixed

Fixed via master branch: f05b5fc9db38e0bc4ccc2941cccf049991b67db2

> Correct the behavior of bulk insert for NB-CC 
> --
>
> Key: HUDI-6962
> URL: https://issues.apache.org/jira/browse/HUDI-6962
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Jing Zhang
>Assignee: Jing Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> How to handle the case if the multiple writer contains a job with bulk insert 
> operation?
> 1. Generated file group id: Generate a fixed file group ID because other jobs 
> will use the fixed file group id suffix instead of random uuid suffix. The 
> behavior needs to be consistent to prevent later writer jobs from writing the 
> records with same primary key to different file groups.
> 2.Deal with the transaction: The conflict resolution of bulk insert could not 
> defer to the compaction phase. Because bulk insert writers flush data into 
> base files, if there are multiple bulk insert job, there might exists 
> multiple base files in the same bucket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated: [HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC (#9896)

2023-10-24 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new f05b5fc9db3 [HUDI-6962] Fix the conflicts resolution for bulk insert 
under NB-CC (#9896)
f05b5fc9db3 is described below

commit f05b5fc9db38e0bc4ccc2941cccf049991b67db2
Author: Jing Zhang 
AuthorDate: Wed Oct 25 09:29:13 2023 +0800

[HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC (#9896)

* Flink bulk_insert with fixed file group id suffix if NB-CC is enabled;
* The bulk_insert writer should resolve conflicts with other writers under 
OCC strategies.
---
 .../apache/hudi/client/utils/TransactionUtils.java |   5 +-
 .../org/apache/hudi/config/HoodieWriteConfig.java  |  11 +
 .../apache/hudi/client/HoodieFlinkWriteClient.java |   4 +-
 .../hudi/sink/StreamWriteOperatorCoordinator.java  |   2 +-
 .../sink/bucket/BucketBulkInsertWriterHelper.java  |  14 +-
 .../hudi/sink/bulk/BulkInsertWriteFunction.java|  15 +-
 .../java/org/apache/hudi/sink/utils/Pipelines.java |   3 +-
 .../hudi/sink/TestWriteMergeOnReadWithCompact.java | 116 +++
 .../hudi/sink/utils/BulkInsertFunctionWrapper.java | 232 +
 .../org/apache/hudi/sink/utils/TestWriteBase.java  |  25 +++
 .../test/java/org/apache/hudi/utils/TestData.java  |   5 +-
 .../org/apache/hudi/adapter/TestStreamConfigs.java |  32 +++
 .../org/apache/hudi/adapter/TestStreamConfigs.java |  32 +++
 .../org/apache/hudi/adapter/TestStreamConfigs.java |  32 +++
 .../org/apache/hudi/adapter/TestStreamConfigs.java |  35 
 .../org/apache/hudi/adapter/TestStreamConfigs.java |  35 
 16 files changed, 581 insertions(+), 17 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/TransactionUtils.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/TransactionUtils.java
index 15f6be8f79a..1bea51721c8 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/TransactionUtils.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/TransactionUtils.java
@@ -21,6 +21,7 @@ package org.apache.hudi.client.utils;
 import org.apache.hudi.client.transaction.ConcurrentOperation;
 import org.apache.hudi.client.transaction.ConflictResolutionStrategy;
 import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.WriteOperationType;
 import org.apache.hudi.common.table.HoodieTableMetaClient;
 import org.apache.hudi.common.table.timeline.HoodieInstant;
 import org.apache.hudi.common.table.timeline.HoodieTimeline;
@@ -67,8 +68,8 @@ public class TransactionUtils {
   Option lastCompletedTxnOwnerInstant,
   boolean reloadActiveTimeline,
   Set pendingInstants) throws HoodieWriteConflictException {
-// Skip to resolve conflict if using non-blocking concurrency control
-if 
(config.getWriteConcurrencyMode().supportsOptimisticConcurrencyControl() && 
!config.isNonBlockingConcurrencyControl()) {
+WriteOperationType operationType = 
thisCommitMetadata.map(HoodieCommitMetadata::getOperationType).orElse(null);
+if (config.needResolveWriteConflict(operationType)) {
   // deal with pendingInstants
   Stream completedInstantsDuringCurrentWriteOperation = 
getCompletedInstantsDuringCurrentWriteOperation(table.getMetaClient(), 
pendingInstants);
 
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
index c9e9b94b1a9..8c08beaaef9 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
@@ -46,6 +46,7 @@ import org.apache.hudi.common.model.HoodieTableType;
 import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload;
 import org.apache.hudi.common.model.RecordPayloadType;
 import org.apache.hudi.common.model.WriteConcurrencyMode;
+import org.apache.hudi.common.model.WriteOperationType;
 import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.common.table.log.block.HoodieLogBlock;
 import org.apache.hudi.common.table.marker.MarkerType;
@@ -2616,6 +2617,16 @@ public class HoodieWriteConfig extends HoodieConfig {
 return props.getInteger(WRITES_FILEID_ENCODING, 
HoodieMetadataPayload.RECORD_INDEX_FIELD_FILEID_ENCODING_UUID);
   }
 
+  public boolean needResolveWriteConflict(WriteOperationType operationType) {
+if (getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()) {
+  // NB-CC don't need to resolve write conflict except bulk insert 
operation
+  return WriteOperationType.BULK_INSERT == operationType || 
!isNonBlockingConcurrencyControl();
+}

Re: [PR] [HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC [hudi]

2023-10-24 Thread via GitHub



danny0405 merged PR #9896:
URL: https://github.com/apache/hudi/pull/9896


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC [hudi]

2023-10-24 Thread via GitHub



danny0405 commented on PR #9896:
URL: https://github.com/apache/hudi/pull/9896#issuecomment-1778327433

   The failed test is known to be flaky: 
`TestHoodieLogFormat.testAvroLogRecordReaderWithMixedInsertsCorruptsRollbackAndMergedLogBlock`
 : 
https://pipelinesghubeus23.actions.githubusercontent.com/2uhBcZr3qV5ap2vibMf4tU0bjg49uuN9wlovCTzCjH6fMLAme0/_apis/pipelines/1/runs/42270/signedlogcontent/13?urlExpires=2023-10-25T01%3A24%3A24.0091887Z&urlSigningMethod=HMACV1&urlSignature=5EgWFpWhEswB%2FySzG2hp2q99FnPaNFTCC3zvozWazEM%3D


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT]flink 写hudi 同步hive后，timestamp字段为什么是bigint类型，如何才能让同步到hive的字段保持timestamp类型 [hudi]

2023-10-24 Thread via GitHub



linrongjun-l commented on issue #9766:
URL: https://github.com/apache/hudi/issues/9766#issuecomment-1778312506

   > > Before release 0.14.0, there is a sync param 
`hive_sync.support_timestamp`, when enabled, the `Timestamp(6)` type would be 
synced as `TIMESTAMP` in hive, since release 0.14.0, all the timestamp type 
would be synced as `TIMESTAMP`.
   > 
   > thanks for your reply. when i use hive_sync.support_timestamp enabled ,in 
hive the field type is TIMESTAMP indeed. but when i select the value in 
hive,there is error :Error: java.io.IOException: 
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: 
org.apache.hadoop.io.LongWritable cannot be cast to 
org.apache.hadoop.hive.serde2.io.TimestampWritable
   
   I also met the same problem, how did you solve it at last?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6977] Upgrade hadoop version from 2.10.1 to 2.10.2 [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9914:
URL: https://github.com/apache/hudi/pull/9914#issuecomment-1778296527

   
   ## CI report:
   
   * 6aa578288e31414d8f13c37525ed4e2b7d9a6521 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20462)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778296319

   
   ## CI report:
   
   * b8bc65dc87cfd1305634bf16f96a97944ce85816 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20432)
 
   * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Add tests on combine parallelism [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9731:
URL: https://github.com/apache/hudi/pull/9731#issuecomment-1778296103

   
   ## CI report:
   
   * 047941b66ee52a99f626fd0dadb72581d9855385 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Add tests on combine parallelism [hudi]

2023-10-24 Thread via GitHub



yihua commented on PR #9731:
URL: https://github.com/apache/hudi/pull/9731#issuecomment-1778294686

   CI is green.
   https://github.com/apache/hudi/assets/2497195/b14e4414-fbb5-4f1b-a3e0-5a2d8335775d";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6977] Upgrade hadoop version from 2.10.1 to 2.10.2 [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9914:
URL: https://github.com/apache/hudi/pull/9914#issuecomment-1778289302

   
   ## CI report:
   
   * 6aa578288e31414d8f13c37525ed4e2b7d9a6521 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6978) Fix TestMergeIntoTable2 test

2023-10-24 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6978:

Description: 
For the test
TestMergeIntoTable2@"Test only insert for source table in dup key without 
preCombineField"

  was:
For the test
"Test only insert for source table in dup key without preCombineField"
@"Test only insert for source table in dup key without preCombineField"


> Fix TestMergeIntoTable2 test
> 
>
> Key: HUDI-6978
> URL: https://issues.apache.org/jira/browse/HUDI-6978
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>
> For the test
> TestMergeIntoTable2@"Test only insert for source table in dup key without 
> preCombineField"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6978) Fix TestMergeIntoTable2 test

2023-10-24 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6978:

Description: 
For the test
"Test only insert for source table in dup key without preCombineField"
@"Test only insert for source table in dup key without preCombineField"

  was:For @"Test only insert for source table in dup key without 
preCombineField"


> Fix TestMergeIntoTable2 test
> 
>
> Key: HUDI-6978
> URL: https://issues.apache.org/jira/browse/HUDI-6978
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>
> For the test
> "Test only insert for source table in dup key without preCombineField"
> @"Test only insert for source table in dup key without preCombineField"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6978) Fix TestMergeIntoTable2 test

2023-10-24 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6978:

Description: 
For the test TestMergeIntoTable2@"Test only insert for source table in dup key 
without preCombineField", after adding "
spark.sql(s"set ${MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT.key} = 0")", the test 
fails:
{code:java}
Expected Array([1,a2,10.4,1004,2021-03-21], [1,a2,10.4,1004,2021-03-21], 
[3,a3,10.3,1003,2021-03-21]), but got Array([1,a2,10.2,1002,2021-03-21], 
[1,a2,10.4,1004,2021-03-21], [3,a3,10.3,1003,2021-03-21])
ScalaTestFailureLocation: org.apache.spark.sql.hudi.HoodieSparkSqlTestBase at 
(HoodieSparkSqlTestBase.scala:109)
org.scalatest.exceptions.TestFailedException: Expected 
Array([1,a2,10.4,1004,2021-03-21], [1,a2,10.4,1004,2021-03-21], 
[3,a3,10.3,1003,2021-03-21]), but got Array([1,a2,10.2,1002,2021-03-21], 
[1,a2,10.4,1004,2021-03-21], [3,a3,10.3,1003,2021-03-21])
    at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
    at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
    at 
org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1562)
    at org.scalatest.Assertions.assertResult(Assertions.scala:867)
    at org.scalatest.Assertions.assertResult$(Assertions.scala:863)
    at org.scalatest.funsuite.AnyFunSuite.assertResult(AnyFunSuite.scala:1562)
    at 
org.apache.spark.sql.hudi.HoodieSparkSqlTestBase.checkAnswer(HoodieSparkSqlTestBase.scala:109)
    at 
org.apache.spark.sql.hudi.TestMergeIntoTable2.$anonfun$new$36(TestMergeIntoTable2.scala:897)
    at 
org.apache.spark.sql.hudi.TestMergeIntoTable2.$anonfun$new$36$adapted(TestMergeIntoTable2.scala:841)
    at 
org.apache.spark.sql.hudi.HoodieSparkSqlTestBase.withTempDir(HoodieSparkSqlTestBase.scala:77)
    at 
org.apache.spark.sql.hudi.TestMergeIntoTable2.$anonfun$new$35(TestMergeIntoTable2.scala:841)
    at 
org.apache.spark.sql.hudi.TestMergeIntoTable2.$anonfun$new$35$adapted(TestMergeIntoTable2.scala:840)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at 
org.apache.spark.sql.hudi.TestMergeIntoTable2.$anonfun$new$34(TestMergeIntoTable2.scala:840)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at 
org.apache.spark.sql.hudi.HoodieSparkSqlTestBase.$anonfun$test$1(HoodieSparkSqlTestBase.scala:85)
    at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
    at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
    at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
    at org.scalatest.Transformer.apply(Transformer.scala:22)
    at org.scalatest.Transformer.apply(Transformer.scala:20)
    at 
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:189)
    at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
    at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
    at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1562)
    at 
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:187)
    at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:199)
    at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
    at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:199)
    at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:181)
    at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1562)
    at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:232)
    at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
    at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
    at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
    at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:232)
    at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:231)
    at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1562)
    at org.scalatest.Suite.run(Suite.scala:1112)
    at org.scalatest.Suite.run$(Suite.scala:1094)
    at 
org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1562)
    at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:236)
    at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
    at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:236)
    at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:235)
    at 
org.apache.spark.sql.hudi.HoodieSparkSqlTestBase.org$scalatest$BeforeAndAfterAll$$super$run(HoodieSparkSqlTestBase.scala:44)
    at 
org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
    at org.scalatest.BeforeAndAfterAll.run(Bef

[jira] [Updated] (HUDI-6978) Fix TestMergeIntoTable2 test

2023-10-24 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6978:

Description: For @"Test only insert for source table in dup key without 
preCombineField"

> Fix TestMergeIntoTable2 test
> 
>
> Key: HUDI-6978
> URL: https://issues.apache.org/jira/browse/HUDI-6978
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>
> For @"Test only insert for source table in dup key without preCombineField"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-6978) Fix TestMergeIntoTable2 test

2023-10-24 Thread Ethan Guo (Jira)

Ethan Guo created HUDI-6978:
---

 Summary: Fix TestMergeIntoTable2 test
 Key: HUDI-6978
 URL: https://issues.apache.org/jira/browse/HUDI-6978
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[I] [SUPPORT] Control file sizing during FULL_RECORD bootstrap mode [hudi]

2023-10-24 Thread via GitHub



fenil25 opened a new issue, #9915:
URL: https://github.com/apache/hudi/issues/9915

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? Yes
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   I want to bootstrap a table into Hudi. Size of the table is around 12 TB. 
The base path of the source table is in S3. Its a partitioned hive table and 
the average parquet file size is 2.5Gb.  I used the FULL_RECORD bootstrap mode 
using Spark for bootstrapping and it was successful. 
   However, the average file size of hudi table was around 120 Mb which aligns 
with the default which ended up creating 100K+ files. I am using S3 storage as 
the DFS. This made the read performance quite slow. 
   I am not using any table partitioning yet. I did set 
`hoodie.parquet.max.file.size": 1258291200,` (~1.2Gb) but this configuration 
was completely ignored. 
   FAQs and File Sizing docs mainly talk about ways to adjust the file size 
while streaming data into Hudi. 
   How can I control the file size during the bootstrapping process itself? 
   
   I also read in the docs that - 
   ```
   A full record bootstrap is functionally equivalent to a bulk-insert.
   ```
   Does that mean both are essentially the same. Is there any advantage of 
using one over the another? (Note: _METADATA_ONLY does not work for our 
use-case_)
   
   
   **Environment Description**
   Running it via EMR 
   
   * Hudi version : 13.0 
   
   * Spark version : 3.3 
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Test ci [hudi]

2023-10-24 Thread via GitHub



kkalanda-score closed pull request #9095: Test ci
URL: https://github.com/apache/hudi/pull/9095


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6551] A new slashed month partition value extractor [hudi]

2023-10-24 Thread via GitHub



yihua closed pull request #9184: [HUDI-6551] A new slashed month partition 
value extractor
URL: https://github.com/apache/hudi/pull/9184


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1778089960

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN
   * 7c353cd134d555bf0adfb50a64f012b609e75308 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20463)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6977] Upgrade hadoop version from 2.10.1 to 2.10.2 [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9914:
URL: https://github.com/apache/hudi/pull/9914#issuecomment-1778090503

   
   ## CI report:
   
   * 6aa578288e31414d8f13c37525ed4e2b7d9a6521 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20462)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6551] A new slashed month partition value extractor [hudi]

2023-10-24 Thread via GitHub



yihua commented on PR #9184:
URL: https://github.com/apache/hudi/pull/9184#issuecomment-1778090063

   Closing this PR now.  @banank1989 feel free to reopen it you need additional 
functionality.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Test ci [hudi]

2023-10-24 Thread via GitHub



yihua commented on PR #9095:
URL: https://github.com/apache/hudi/pull/9095#issuecomment-1778088836

   @kkalanda-score do you still need this PR?  If not, the PR should be closed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6898] Medatawriter closing in tests, update logging [hudi]

2023-10-24 Thread via GitHub



yihua merged PR #9768:
URL: https://github.com/apache/hudi/pull/9768


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub



yihua commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778076328

   I discussed the comments with @danny0405 offline.  Two things to address in 
this PR:
   
   (1) Instead of putting both partial and full schemas in the log block 
header, when partial updates are enabled, only the partial schema is added to 
the log block header in the same `SCHEMA` header and the full schema for 
snapshot reads is always going to be passed in from the table schema.  To 
indicate the schema is partial, a new log block header `IS_PARTIAL` should be 
added.
   (2) We should let users in the MERGE INTO statement to specify if they want 
partial updates in the log files in MOR tables, e.g., using sth like `col = 
EXISTING` to indicate that the column values should be kept as is.  We may not 
support this in the PR, but instead we should have an interim write config to 
control this behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub



yihua commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1370828538


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala:
##
@@ -411,10 +414,14 @@ object ExpressionPayload {
 parseSchema(props.getProperty(PAYLOAD_RECORD_AVRO_SCHEMA))
   }
 
-  private def getWriterSchema(props: Properties): Schema = {
-
ValidationUtils.checkArgument(props.containsKey(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key),
-  s"Missing ${HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key} property")
-parseSchema(props.getProperty(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key))
+  private def getWriterSchema(props: Properties, isPartialUpdate: Boolean): 
Schema = {
+if (isPartialUpdate) {
+  
parseSchema(props.getProperty(HoodieWriteConfig.WRITE_PARTIAL_UPDATE_SCHEMA.key))

Review Comment:
   Agree that option 1 is the most natural handling.
   
   In the current schema evolution on write, the write schema is evolved based 
on the input, and the evolved schema is written to the commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6836] Shutting down deltastreamer in tests and shutting down metrics for write client [hudi]

2023-10-24 Thread via GitHub



yihua commented on PR #9667:
URL: https://github.com/apache/hudi/pull/9667#issuecomment-1778007738

   @pratyakshsharma are you good with the changes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6877] Fix avro read issue after ALTER TABLE RENAME DDL on Spark3_1 [hudi]

2023-10-24 Thread via GitHub



yihua commented on code in PR #9752:
URL: https://github.com/apache/hudi/pull/9752#discussion_r1370788996


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieDataBlock.java:
##
@@ -115,6 +114,35 @@ public byte[] getContentBytes() throws IOException {
 return serializeRecords(records.get());
   }
 
+  private Schema getReaderSchema(Option readerSchemaOpt) {
+Schema writerSchema = getWriterSchema(super.getLogBlockHeader());
+// If no reader-schema has been provided assume writer-schema as one
+if (!readerSchemaOpt.isPresent()) {
+  return writerSchema;
+}
+
+// Handle table renames when there are still log files
+Schema readerSchema = readerSchemaOpt.get();
+if (isHandleDifferingNamespaceRequired(readerSchema, writerSchema)) {
+  return writerSchema;
+} else {
+  return readerSchema;
+}
+  }
+
+  /**
+   * Spark3.1 uses avro:1.8.2, which matches fields by their fully qualified 
name. If namespaces are differs, reads will fail for fields that have the same 
name and type, but differing name(spaces).
+   * Such cases can arise when an ALTER-TABLE-RENAME ddl is performed.
+   *
+   * @param readerSchema the reader schema
+   * @param writerSchema the writer schema
+   * @return boolean if handling of differing namespaces between reader and 
writer schema are required
+   */
+  private static boolean isHandleDifferingNamespaceRequired(Schema 
readerSchema, Schema writerSchema) {
+return 
readerSchema.getClass().getPackage().getImplementationVersion().compareTo("1.8.2")
 <= 0
+&& !readerSchema.getName().equals(writerSchema.getName());
+  }

Review Comment:
   The fix works.  But I'm thinking at this layer of reading log files, such 
details should not be exposed.  It's better to fix the schema generation in 
ALTER-TABLE-RENAME DDL to have consistent namespaces, or resolve the schema's 
namespace on upper layers, e.g., `TableSchemaResolver`.  And the schema 
namespace should not change across Hudi commits.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6898] Medatawriter closing in tests, update logging [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9768:
URL: https://github.com/apache/hudi/pull/9768#issuecomment-1778003218

   
   ## CI report:
   
   * 55beb62d168b2c9b9d99f0c3765637d441f58b5f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20458)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6877] Fix avro read issue after ALTER TABLE RENAME DDL on Spark3_1 [hudi]

2023-10-24 Thread via GitHub



yihua commented on PR #9752:
URL: https://github.com/apache/hudi/pull/9752#issuecomment-1777997252

   > Seems we have plan to mograte to avro above 1.8.2 right cc @yihua ~
   The Avro dependency version is tied to Spark version and Avro 1.8.2 is tied 
to Spark 3.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6895][WIP] Change default timeline timezone from local to UTC [hudi]

2023-10-24 Thread via GitHub



yihua commented on PR #9794:
URL: https://github.com/apache/hudi/pull/9794#issuecomment-1777989084

   @codope do we still plan to land this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6959] Bulk insert V2 do not rollback failed instant on abort [hudi]

2023-10-24 Thread via GitHub



yihua commented on code in PR #9887:
URL: https://github.com/apache/hudi/pull/9887#discussion_r1370778292


##
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/DataSourceInternalWriterHelper.java:
##
@@ -97,7 +97,6 @@ public void commit(List writeStatuses) {
 
   public void abort() {
 LOG.error("Commit " + instantTime + " aborted ");
-writeClient.rollback(instantTime);

Review Comment:
   The fix makes sense based on the information created.  @stream2000 could you 
add a test to verify that after bulk insert with DS v2 fails, the commit is 
left inflight, and a subsequent new writer / transaction rolls back the commit?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-24 Thread via GitHub



yihua commented on code in PR #9888:
URL: https://github.com/apache/hudi/pull/9888#discussion_r1370761534


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodiePartitionCDCFileGroupMapping.scala:
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.common.model.HoodieFileGroupId
+import org.apache.hudi.common.table.cdc.HoodieCDCFileSplit
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.util.{ArrayData, MapData}
+import org.apache.spark.sql.types.{DataType, Decimal}
+import org.apache.spark.unsafe.types.{CalendarInterval, UTF8String}
+
+import java.util
+
+case class HoodiePartitionCDCFileGroupMapping(partitionValues: InternalRow,
+  fileGroups: 
Map[HoodieFileGroupId, List[HoodieCDCFileSplit]]
+ ) extends InternalRow {
+
+  def getFileSplitsFor(fileGroupId: HoodieFileGroupId): 
Option[List[HoodieCDCFileSplit]] = {
+fileGroups.get(fileGroupId)
+  }
+
+  override def numFields: Int = {
+partitionValues.numFields
+  }
+
+  override def setNullAt(i: Int): Unit = {
+partitionValues.setNullAt(i)
+  }
+
+  override def update(i: Int, value: Any): Unit = {
+partitionValues.update(i, value)
+  }
+
+  override def copy(): InternalRow = {
+HoodiePartitionCDCFileGroupMapping(partitionValues.copy(), fileGroups)
+  }
+
+  override def isNullAt(ordinal: Int): Boolean = {
+partitionValues.isNullAt(ordinal)
+  }
+
+  override def getBoolean(ordinal: Int): Boolean = {
+partitionValues.getBoolean(ordinal)
+  }
+
+  override def getByte(ordinal: Int): Byte = {
+partitionValues.getByte(ordinal)
+  }
+
+  override def getShort(ordinal: Int): Short = {
+partitionValues.getShort(ordinal)
+  }
+
+  override def getInt(ordinal: Int): Int = {
+partitionValues.getInt(ordinal)
+  }
+
+  override def getLong(ordinal: Int): Long = {
+partitionValues.getLong(ordinal)
+  }
+
+  override def getFloat(ordinal: Int): Float = {
+partitionValues.getFloat(ordinal)
+  }
+
+  override def getDouble(ordinal: Int): Double = {
+partitionValues.getDouble(ordinal)
+  }
+
+  override def getDecimal(ordinal: Int, precision: Int, scale: Int): Decimal = 
{
+partitionValues.getDecimal(ordinal, precision, scale)
+  }
+
+  override def getUTF8String(ordinal: Int): UTF8String = {
+partitionValues.getUTF8String(ordinal)
+  }
+
+  override def getBinary(ordinal: Int): Array[Byte] = {
+partitionValues.getBinary(ordinal)
+  }
+
+  override def getInterval(ordinal: Int): CalendarInterval = {
+partitionValues.getInterval(ordinal)
+  }
+
+  override def getStruct(ordinal: Int, numFields: Int): InternalRow = {
+partitionValues.getStruct(ordinal, numFields)
+  }
+
+  override def getArray(ordinal: Int): ArrayData = {
+partitionValues.getArray(ordinal)
+  }
+
+  override def getMap(ordinal: Int): MapData = {
+partitionValues.getMap(ordinal)
+  }
+
+  override def get(ordinal: Int, dataType: DataType): AnyRef = {
+partitionValues.getMap(ordinal)

Review Comment:
   this should be `partitionValues.get(ordinal, dataType)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-24 Thread via GitHub



yihua commented on code in PR #9888:
URL: https://github.com/apache/hudi/pull/9888#discussion_r1370759016


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala:
##
@@ -141,12 +145,37 @@ class 
HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState,
   case _ => baseFileReader(file)
 }
   }
+// CDC queries.
+case hoodiePartitionCDCFileGroupSliceMapping: 
HoodiePartitionCDCFileGroupMapping =>
+  val filePath: Path = 
sparkAdapter.getSparkPartitionedFileUtils.getPathFromPartitionedFile(file)
+  val fileGroupId: HoodieFileGroupId = new 
HoodieFileGroupId(filePath.getParent.toString, filePath.getName)
+  val fileSplits = 
hoodiePartitionCDCFileGroupSliceMapping.getFileSplitsFor(fileGroupId).get.toArray
+  val fileGroupSplit: HoodieCDCFileGroupSplit = 
HoodieCDCFileGroupSplit(fileSplits)
+  buildCDCRecordIterator(fileGroupSplit, preMergeBaseFileReader, 
hadoopConf, requiredSchema, props)
 // TODO: Use FileGroupReader here: HUDI-6942.
 case _ => baseFileReader(file)
   }
 }
   }
 
+  protected def buildCDCRecordIterator(cdcFileGroupSplit: 
HoodieCDCFileGroupSplit,
+   preMergeBaseFileReader: PartitionedFile 
=> Iterator[InternalRow],
+   hadoopConf: Configuration,
+   requiredSchema: StructType,
+   props: TypedProperties): 
Iterator[InternalRow] = {
+val metaClient = 
HoodieTableMetaClient.initTableAndGetMetaClient(hadoopConf, 
tableState.tablePath, props)
+val cdcSchema = CDCRelation.FULL_CDC_SPARK_SCHEMA
+new CDCFileGroupIterator(

Review Comment:
   This does not seem to leverage `HoodieFileGroupReader`.



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala:
##
@@ -119,6 +120,35 @@ case class MergeOnReadIncrementalRelation(override val 
sqlContext: SQLContext,
 }
   }
 
+  def listFileSplits(partitionFilters: Seq[Expression], dataFilters: 
Seq[Expression]): Map[InternalRow, Seq[FileSlice]] = {

Review Comment:
   Could this be extracted out as a util method instead of sitting inside the 
MOR incremental relation, which will not be used by the new Hudi parquet file 
format class?



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/CDCFileGroupIterator.scala:
##
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.cdc
+
+import org.apache.avro.Schema
+import org.apache.avro.generic.{GenericData, GenericRecord, IndexedRecord}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hudi.HoodieBaseRelation.BaseFileReader
+import org.apache.hudi.HoodieConversionUtils.toScalaOption
+import org.apache.hudi.HoodieDataSourceHelper.AvroDeserializerSupport
+import org.apache.hudi.avro.HoodieAvroUtils
+import org.apache.hudi.{AvroConversionUtils, AvroProjection, HoodieFileIndex, 
HoodieMergeOnReadFileSplit, HoodieTableSchema, HoodieTableState, 
LogFileIterator, RecordMergingFileIterator, SparkAdapterSupport}
+import org.apache.hudi.common.config.{HoodieMetadataConfig, TypedProperties}
+import org.apache.hudi.common.model.{FileSlice, HoodieAvroRecordMerger, 
HoodieLogFile, HoodieRecord, HoodieRecordMerger, HoodieRecordPayload}
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import org.apache.hudi.common.table.cdc.{HoodieCDCFileSplit, HoodieCDCUtils}
+import org.apache.hudi.common.table.cdc.HoodieCDCInferenceCase._
+import org.apache.hudi.common.table.log.HoodieCDCLogRecordIterator
+import org.apache.hudi.common.table.cdc.HoodieCDCOperation._
+import org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode._
+import org.apache.hudi.common.util.ValidationUtils.checkState
+import org.apache.hudi.config.HoodiePayloadConfig
+import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory
+import 
org.apache.spark.sql.HoodieCatalystEx

Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-24 Thread via GitHub



hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1777922386

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN
   * 0fe4d74eb04601d878a44c6d8892168e1e321d1a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20405)
 
   * 7c353cd134d555bf0adfb50a64f012b609e75308 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20463)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 >

1 - 100 of 185 matches

Mail list logo