[I] [SUPPORT] Duplicate data in base file of MOR table [hudi]

2024-03-18 Thread via GitHub


wqwl611 opened a new issue, #10882:
URL: https://github.com/apache/hudi/issues/10882

   I want to upgrade Hudi from 0.11.1 to 0.13.1, but I encountered the problem 
of duplicate data. I have never encountered it before with the same 
configuration.
   
   
   https://github.com/apache/hudi/assets/67826098/9f10b00b-c47f-47dd-959c-1941998614d5";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DOCS] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-18 Thread via GitHub


geserdugarov commented on PR #10856:
URL: https://github.com/apache/hudi/pull/10856#issuecomment-2005976263

   Changed all using of `cleaner` to `clean`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-18 Thread via GitHub


geserdugarov commented on code in PR #10851:
URL: https://github.com/apache/hudi/pull/10851#discussion_r1529818335


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##
@@ -118,28 +120,32 @@ public class HoodieCleanConfig extends HoodieConfig {
   + "the minimum number of file slices to retain in each file group, 
during cleaning.");
 
   public static final ConfigProperty CLEAN_TRIGGER_STRATEGY = 
ConfigProperty
-  .key("hoodie.clean.trigger.strategy")
+  .key("hoodie.cleaner.trigger.strategy")
   .defaultValue(CleaningTriggerStrategy.NUM_COMMITS.name())
+  .withAlternatives("hoodie.clean.trigger.strategy")
   .markAdvanced()
   .withDocumentation(CleaningTriggerStrategy.class);
 
   public static final ConfigProperty CLEAN_MAX_COMMITS = ConfigProperty
-  .key("hoodie.clean.max.commits")
+  .key("hoodie.cleaner.trigger.max.commits")
   .defaultValue("1")
+  .withAlternatives("hoodie.clean.max.commits")
   .markAdvanced()
   .withDocumentation("Number of commits after the last clean operation, 
before scheduling of a new clean is attempted.");
 
   public static final ConfigProperty CLEANER_INCREMENTAL_MODE_ENABLE = 
ConfigProperty
-  .key("hoodie.cleaner.incremental.mode")
+  .key("hoodie.cleaner.incremental.enabled")
   .defaultValue("true")
+  .withAlternatives("hoodie.cleaner.incremental.mode")
   .markAdvanced()
   .withDocumentation("When enabled, the plans for each cleaner service run 
is computed incrementally off the events "
   + "in the timeline, since the last cleaner run. This is much more 
efficient than obtaining listings for the full "
   + "table for each planning (even with a metadata table).");
 
   public static final ConfigProperty FAILED_WRITES_CLEANER_POLICY = 
ConfigProperty
-  .key("hoodie.cleaner.policy.failed.writes")
+  .key("hoodie.cleaner.failed.writes.policy")

Review Comment:
   It makes sense, changed all `cleaner` to `clean`, and updated the overview 
table in this MR description.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-18 Thread via GitHub


geserdugarov commented on code in PR #10851:
URL: https://github.com/apache/hudi/pull/10851#discussion_r1529818335


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##
@@ -118,28 +120,32 @@ public class HoodieCleanConfig extends HoodieConfig {
   + "the minimum number of file slices to retain in each file group, 
during cleaning.");
 
   public static final ConfigProperty CLEAN_TRIGGER_STRATEGY = 
ConfigProperty
-  .key("hoodie.clean.trigger.strategy")
+  .key("hoodie.cleaner.trigger.strategy")
   .defaultValue(CleaningTriggerStrategy.NUM_COMMITS.name())
+  .withAlternatives("hoodie.clean.trigger.strategy")
   .markAdvanced()
   .withDocumentation(CleaningTriggerStrategy.class);
 
   public static final ConfigProperty CLEAN_MAX_COMMITS = ConfigProperty
-  .key("hoodie.clean.max.commits")
+  .key("hoodie.cleaner.trigger.max.commits")
   .defaultValue("1")
+  .withAlternatives("hoodie.clean.max.commits")
   .markAdvanced()
   .withDocumentation("Number of commits after the last clean operation, 
before scheduling of a new clean is attempted.");
 
   public static final ConfigProperty CLEANER_INCREMENTAL_MODE_ENABLE = 
ConfigProperty
-  .key("hoodie.cleaner.incremental.mode")
+  .key("hoodie.cleaner.incremental.enabled")
   .defaultValue("true")
+  .withAlternatives("hoodie.cleaner.incremental.mode")
   .markAdvanced()
   .withDocumentation("When enabled, the plans for each cleaner service run 
is computed incrementally off the events "
   + "in the timeline, since the last cleaner run. This is much more 
efficient than obtaining listings for the full "
   + "table for each planning (even with a metadata table).");
 
   public static final ConfigProperty FAILED_WRITES_CLEANER_POLICY = 
ConfigProperty
-  .key("hoodie.cleaner.policy.failed.writes")
+  .key("hoodie.cleaner.failed.writes.policy")

Review Comment:
   It makes sense, changed all `cleaner` to `clean`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7513] Add jackson-module-scala to spark bundle [hudi]

2024-03-18 Thread via GitHub


nsivabalan commented on PR #10877:
URL: https://github.com/apache/hudi/pull/10877#issuecomment-2005925712

   yes, we might need to fix all bundles. 
   @ad1happy2go : can you try out the patch w/ spark and utilities bundle and 
let us know if we are good here. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7510] Loosen the compaction scheduling and rollback check for MDT [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10874:
URL: https://github.com/apache/hudi/pull/10874#issuecomment-2005869170

   
   ## CI report:
   
   * 46e36c45556766aea812b45b8f4fa7aec27e9bc0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22932)
 
   * 2b2014449a7254f2d4e733ea281b9b82bf2b5158 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22950)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10851:
URL: https://github.com/apache/hudi/pull/10851#issuecomment-2005868894

   
   ## CI report:
   
   * 054e183e0ccfaf58d0034ea76c322b5f166859b8 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22949)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7510] Loosen the compaction scheduling and rollback check for MDT [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10874:
URL: https://github.com/apache/hudi/pull/10874#issuecomment-2005849471

   
   ## CI report:
   
   * 46e36c45556766aea812b45b8f4fa7aec27e9bc0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22932)
 
   * 2b2014449a7254f2d4e733ea281b9b82bf2b5158 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-18 Thread via GitHub


danny0405 commented on code in PR #10851:
URL: https://github.com/apache/hudi/pull/10851#discussion_r1529748742


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##
@@ -118,28 +120,32 @@ public class HoodieCleanConfig extends HoodieConfig {
   + "the minimum number of file slices to retain in each file group, 
during cleaning.");
 
   public static final ConfigProperty CLEAN_TRIGGER_STRATEGY = 
ConfigProperty
-  .key("hoodie.clean.trigger.strategy")
+  .key("hoodie.cleaner.trigger.strategy")
   .defaultValue(CleaningTriggerStrategy.NUM_COMMITS.name())
+  .withAlternatives("hoodie.clean.trigger.strategy")
   .markAdvanced()
   .withDocumentation(CleaningTriggerStrategy.class);
 
   public static final ConfigProperty CLEAN_MAX_COMMITS = ConfigProperty
-  .key("hoodie.clean.max.commits")
+  .key("hoodie.cleaner.trigger.max.commits")
   .defaultValue("1")
+  .withAlternatives("hoodie.clean.max.commits")
   .markAdvanced()
   .withDocumentation("Number of commits after the last clean operation, 
before scheduling of a new clean is attempted.");
 
   public static final ConfigProperty CLEANER_INCREMENTAL_MODE_ENABLE = 
ConfigProperty
-  .key("hoodie.cleaner.incremental.mode")
+  .key("hoodie.cleaner.incremental.enabled")
   .defaultValue("true")
+  .withAlternatives("hoodie.cleaner.incremental.mode")
   .markAdvanced()
   .withDocumentation("When enabled, the plans for each cleaner service run 
is computed incrementally off the events "
   + "in the timeline, since the last cleaner run. This is much more 
efficient than obtaining listings for the full "
   + "table for each planning (even with a metadata table).");
 
   public static final ConfigProperty FAILED_WRITES_CLEANER_POLICY = 
ConfigProperty
-  .key("hoodie.cleaner.policy.failed.writes")
+  .key("hoodie.cleaner.failed.writes.policy")

Review Comment:
   Do we prefer `clean` or `cleaner` or `cleaning`? Maybe just stays with 
`clean`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7510] Loosen the compaction scheduling and rollback check for MDT [hudi]

2024-03-18 Thread via GitHub


suryaprasanna commented on code in PR #10874:
URL: https://github.com/apache/hudi/pull/10874#discussion_r1529733763


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1407,43 +1382,7 @@ protected void cleanIfNecessary(BaseHoodieWriteClient 
writeClient, String instan
 writeClient.lazyRollbackFailedIndexing();
   }
 
-  /**
-   * Validates the timeline for both main and metadata tables to ensure 
compaction on MDT can be scheduled.
-   */
-  protected boolean validateTimelineBeforeSchedulingCompaction(Option 
inFlightInstantTimestamp, String latestDeltaCommitTimeInMetadataTable) {
-// we need to find if there are any inflights in data table timeline 
before or equal to the latest delta commit in metadata table.
-// Whenever you want to change this logic, please ensure all below 
scenarios are considered.
-// a. There could be a chance that latest delta commit in MDT is committed 
in MDT, but failed in DT. And so findInstantsBeforeOrEquals() should be employed
-// b. There could be DT inflights after latest delta commit in MDT and we 
are ok with it. bcoz, the contract is, the latest compaction instant time in 
MDT represents
-// any instants before that is already synced with metadata table.
-// c. Do consider out of order commits. For eg, c4 from DT could complete 
before c3. and we can't trigger compaction in MDT with c4 as base instant time, 
until every
-// instant before c4 is synced with metadata table.
-List pendingInstants = 
dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested()
-
.findInstantsBeforeOrEquals(latestDeltaCommitTimeInMetadataTable).getInstants();
-
-if (!pendingInstants.isEmpty()) {
-  checkNumDeltaCommits(metadataMetaClient, 
dataWriteConfig.getMetadataConfig().getMaxNumDeltacommitsWhenPending());
-  LOG.info(String.format(
-  "Cannot compact metadata table as there are %d inflight instants in 
data table before latest deltacommit in metadata table: %s. Inflight instants 
in data table: %s",
-  pendingInstants.size(), latestDeltaCommitTimeInMetadataTable, 
Arrays.toString(pendingInstants.toArray(;
-  return false;
-}
-
-// Check if there are any pending compaction or log compaction instants in 
the timeline.
-// If pending compact/logCompaction operations are found abort scheduling 
new compaction/logCompaction operations.
-Option pendingLogCompactionInstant =

Review Comment:
   As discussed offline, please create unit test around the inflight scenario 
and necessary followup ticket to remove logcompaction.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7510] Loosen the compaction scheduling and rollback check for MDT [hudi]

2024-03-18 Thread via GitHub


danny0405 commented on code in PR #10874:
URL: https://github.com/apache/hudi/pull/10874#discussion_r1529733125


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1407,43 +1382,7 @@ protected void cleanIfNecessary(BaseHoodieWriteClient 
writeClient, String instan
 writeClient.lazyRollbackFailedIndexing();
   }
 
-  /**
-   * Validates the timeline for both main and metadata tables to ensure 
compaction on MDT can be scheduled.
-   */
-  protected boolean validateTimelineBeforeSchedulingCompaction(Option 
inFlightInstantTimestamp, String latestDeltaCommitTimeInMetadataTable) {
-// we need to find if there are any inflights in data table timeline 
before or equal to the latest delta commit in metadata table.
-// Whenever you want to change this logic, please ensure all below 
scenarios are considered.
-// a. There could be a chance that latest delta commit in MDT is committed 
in MDT, but failed in DT. And so findInstantsBeforeOrEquals() should be employed
-// b. There could be DT inflights after latest delta commit in MDT and we 
are ok with it. bcoz, the contract is, the latest compaction instant time in 
MDT represents
-// any instants before that is already synced with metadata table.
-// c. Do consider out of order commits. For eg, c4 from DT could complete 
before c3. and we can't trigger compaction in MDT with c4 as base instant time, 
until every
-// instant before c4 is synced with metadata table.
-List pendingInstants = 
dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested()
-
.findInstantsBeforeOrEquals(latestDeltaCommitTimeInMetadataTable).getInstants();
-
-if (!pendingInstants.isEmpty()) {
-  checkNumDeltaCommits(metadataMetaClient, 
dataWriteConfig.getMetadataConfig().getMaxNumDeltacommitsWhenPending());
-  LOG.info(String.format(
-  "Cannot compact metadata table as there are %d inflight instants in 
data table before latest deltacommit in metadata table: %s. Inflight instants 
in data table: %s",
-  pendingInstants.size(), latestDeltaCommitTimeInMetadataTable, 
Arrays.toString(pendingInstants.toArray(;
-  return false;
-}
-
-// Check if there are any pending compaction or log compaction instants in 
the timeline.
-// If pending compact/logCompaction operations are found abort scheduling 
new compaction/logCompaction operations.
-Option pendingLogCompactionInstant =

Review Comment:
   Finally discussed offline with @suryaprasanna and we are convinced:
   
   1. The first part check for pending instants can be removed because we have 
NB-CC style file slicing, the instant time sequence does not really matter, 
only the completion time plays the role;
   2. The second part check can be limited to only the log compaction scope.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7510] Loosen the compaction scheduling and rollback check for MDT [hudi]

2024-03-18 Thread via GitHub


danny0405 commented on code in PR #10874:
URL: https://github.com/apache/hudi/pull/10874#discussion_r1529733125


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1407,43 +1382,7 @@ protected void cleanIfNecessary(BaseHoodieWriteClient 
writeClient, String instan
 writeClient.lazyRollbackFailedIndexing();
   }
 
-  /**
-   * Validates the timeline for both main and metadata tables to ensure 
compaction on MDT can be scheduled.
-   */
-  protected boolean validateTimelineBeforeSchedulingCompaction(Option 
inFlightInstantTimestamp, String latestDeltaCommitTimeInMetadataTable) {
-// we need to find if there are any inflights in data table timeline 
before or equal to the latest delta commit in metadata table.
-// Whenever you want to change this logic, please ensure all below 
scenarios are considered.
-// a. There could be a chance that latest delta commit in MDT is committed 
in MDT, but failed in DT. And so findInstantsBeforeOrEquals() should be employed
-// b. There could be DT inflights after latest delta commit in MDT and we 
are ok with it. bcoz, the contract is, the latest compaction instant time in 
MDT represents
-// any instants before that is already synced with metadata table.
-// c. Do consider out of order commits. For eg, c4 from DT could complete 
before c3. and we can't trigger compaction in MDT with c4 as base instant time, 
until every
-// instant before c4 is synced with metadata table.
-List pendingInstants = 
dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested()
-
.findInstantsBeforeOrEquals(latestDeltaCommitTimeInMetadataTable).getInstants();
-
-if (!pendingInstants.isEmpty()) {
-  checkNumDeltaCommits(metadataMetaClient, 
dataWriteConfig.getMetadataConfig().getMaxNumDeltacommitsWhenPending());
-  LOG.info(String.format(
-  "Cannot compact metadata table as there are %d inflight instants in 
data table before latest deltacommit in metadata table: %s. Inflight instants 
in data table: %s",
-  pendingInstants.size(), latestDeltaCommitTimeInMetadataTable, 
Arrays.toString(pendingInstants.toArray(;
-  return false;
-}
-
-// Check if there are any pending compaction or log compaction instants in 
the timeline.
-// If pending compact/logCompaction operations are found abort scheduling 
new compaction/logCompaction operations.
-Option pendingLogCompactionInstant =

Review Comment:
   Finally discussed offline with @suryaprasanna and we are concinved:
   
   1. The first part check for pending instants can be removed because we have 
NB-CC style file slicing, the instant time sequence does not really matter, 
only the completion time plays the role;
   2. The second part check can be limited to only the log compaction scope.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10851:
URL: https://github.com/apache/hudi/pull/10851#issuecomment-2005780416

   
   ## CI report:
   
   * d7e56c6660139f382830dc545ab703445fba9153 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22936)
 
   * 054e183e0ccfaf58d0034ea76c322b5f166859b8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22949)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10851:
URL: https://github.com/apache/hudi/pull/10851#issuecomment-2005773466

   
   ## CI report:
   
   * d7e56c6660139f382830dc545ab703445fba9153 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22936)
 
   * 054e183e0ccfaf58d0034ea76c322b5f166859b8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DOCS] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-18 Thread via GitHub


geserdugarov commented on code in PR #10856:
URL: https://github.com/apache/hudi/pull/10856#discussion_r1529713723


##
website/docs/basic_configurations.md:
##
@@ -101,15 +102,14 @@ Flink jobs using the SQL can be configured through the 
options in WITH clause. T
 | [hoodie.database.name](#hoodiedatabasename)  
| (N/A) | Database name to register 
to Hive metastore `Config Param: DATABASE_NAME`   



|
 | [hoodie.table.name](#hoodietablename)
| (N/A) | Table name to register to 
Hive metastore `Config Param: TABLE_NAME` 



|
 | [path](#path)
| (N/A) | Base path for the target 
hoodie table. The path would be created if it does not exist, otherwise a 
Hoodie table expects to be initialized successfully `Config Param: PATH`  


   |
+| [read.commits.limit](#readcommitslimit)  
| (N/A) | The maximum number of 
commits allowed to read in each instant check, if it is streaming read, the avg 
read instants number per-second would be 
'read.commits.limit'/'read.streaming.check-interval', by default no limit 
`Config Param: READ_COMMITS_LIMIT`  

   |
 | [read.end-commit](#readend-commit)   
| (N/A) | End commit instant for 
reading, the commit time format should be 'MMddHHmmss' `Config Param: 
READ_END_COMMIT`


   |
 | [read.start-commit](#readstart-commit)   
| (N/A) | Start commit instant for 
reading, the commit time format should be 'MMddHHmmss', by default reading 
from the latest instant for streaming read `Config Param: 
READ_START_COMMIT`  

  |
 | [archive.max_commits](#archivemax_commits)   
| 50| Max number of commits to 
keep before archiving older commits into a sequential log, default 50 
`Config Param: ARCHIVE_MAX_COMMITS` 


 |
 | [archive.min_commits](#archivemin_commits)   
| 40| Min number of commits to 
keep before archiving older commits into a sequential log, default 40 
`Config Param: ARCHIVE_MIN_COMMITS` 


 |
 | [cdc.enabled](#cdcenabled)   
| false | When enable, persist the 
change data if necessary, and can be queried as a CDC query mode `Config 
Param: CDC_ENABLED` 

  

Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10866:
URL: https://github.com/apache/hudi/pull/10866#issuecomment-2005765864

   
   ## CI report:
   
   * 113f4e5717cdf97aed4f91cecd965c4cec02dec0 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22948)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7504] replace expensive existence check with spark options [hudi]

2024-03-18 Thread via GitHub


bhat-vinay commented on code in PR #10865:
URL: https://github.com/apache/hudi/pull/10865#discussion_r1529707163


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java:
##
@@ -112,10 +110,15 @@ public S3EventsHoodieIncrSource(
   QueryRunner queryRunner,
   CloudDataFetcher cloudDataFetcher) {
 super(props, sparkContext, sparkSession, schemaProvider);
+
+if (getBooleanWithAltKeys(props, ENABLE_EXISTS_CHECK)) {
+  sparkSession.conf().set("spark.sql.files.ignoreMissingFiles", "true");
+  sparkSession.conf().set("spark.sql.files.ignoreCorruptFiles", "true");

Review Comment:
   As I stated above, it did not work. Seems similar to issues 
like[this](https://stackoverflow.com/questions/77195180/ignoremissingfiles-option-not-working-in-certain-scenarios)
 mentioned online.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-18 Thread via GitHub


geserdugarov commented on code in PR #10851:
URL: https://github.com/apache/hudi/pull/10851#discussion_r1529692618


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##
@@ -725,40 +725,45 @@ private FlinkOptions() {
   .withDescription("Target IO in MB for per compaction (both read and 
write), default 500 GB");
 
   public static final ConfigOption CLEAN_ASYNC_ENABLED = ConfigOptions
-  .key("clean.async.enabled")
+  .key("hoodie.cleaner.async.enabled")

Review Comment:
   Ok, I see your point. Modified corresponding commit by saving previous keys 
in `FlinkOptions` and adding new keys as fallback keys. Rebased on the current 
master and force pushed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Insert overwrite with replacement instant cannot execute archive [hudi]

2024-03-18 Thread via GitHub


ad1happy2go commented on issue #10873:
URL: https://github.com/apache/hudi/issues/10873#issuecomment-2005742951

   Sorry @xuzifu666 . Didn't understood. Do you want to say it should archive 
or not?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10866:
URL: https://github.com/apache/hudi/pull/10866#issuecomment-2005722576

   
   ## CI report:
   
   * 538d59f11cfca29af8550391f0ce7a491279c0a9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22906)
 
   * 113f4e5717cdf97aed4f91cecd965c4cec02dec0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22948)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10866:
URL: https://github.com/apache/hudi/pull/10866#issuecomment-2005717768

   
   ## CI report:
   
   * 538d59f11cfca29af8550391f0ce7a491279c0a9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22906)
 
   * 113f4e5717cdf97aed4f91cecd965c4cec02dec0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22948)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7508] Avoid collecting records in HoodieStreamerUtils.createHoodieRecords and JsonKafkaSource mapPartitions [hudi]

2024-03-18 Thread via GitHub


CTTY commented on code in PR #10872:
URL: https://github.com/apache/hudi/pull/10872#discussion_r1529666064


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java:
##
@@ -57,6 +57,8 @@
  */
 public class JsonKafkaSource extends KafkaSource {
 
+  private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();

Review Comment:
   Just being curious, do we have to use static object mapper here or it's fine 
to wrap it in the iterator as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]

2024-03-18 Thread via GitHub


wombatu-kun commented on PR #10866:
URL: https://github.com/apache/hudi/pull/10866#issuecomment-2005715865

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7513] Add jackson-module-scala to spark bundle [hudi]

2024-03-18 Thread via GitHub


danny0405 commented on PR #10877:
URL: https://github.com/apache/hudi/pull/10877#issuecomment-2005707868

   @nsivabalan Can you check this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Archived parquet file lenght is 0 when spark do streaming read [hudi]

2024-03-18 Thread via GitHub


danny0405 commented on issue #10881:
URL: https://github.com/apache/hudi/issues/10881#issuecomment-2005697002

   So you can read the parquet file with spark native reader correctly? Is 
there any error log from the writer while doing the archiving?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10866:
URL: https://github.com/apache/hudi/pull/10866#issuecomment-2005678158

   
   ## CI report:
   
   * 538d59f11cfca29af8550391f0ce7a491279c0a9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22906)
 
   * 113f4e5717cdf97aed4f91cecd965c4cec02dec0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22948)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Archived parquet file lenght is 0 when spark do streaming read [hudi]

2024-03-18 Thread via GitHub


xicm commented on issue #10881:
URL: https://github.com/apache/hudi/issues/10881#issuecomment-2005676734

   This is the parquet file 
   
![image](https://github.com/apache/hudi/assets/36392121/c772d5f6-0fb5-4f6c-b6d1-2f69c9a4fb25)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Archived parquet file lenght is 0 when spark do streaming read [hudi]

2024-03-18 Thread via GitHub


xicm opened a new issue, #10881:
URL: https://github.com/apache/hudi/issues/10881

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   ```
   24/03/19 10:58:34 WARN ProcessingTimeExecutor: Current batch is falling 
behind. The trigger interval is 1000 milliseconds, but spent 1141 milliseconds
   24/03/19 10:58:38 ERROR MicroBatchExecution: Query [id = 
3b464aa8-542d-4eea-b45a-4db8f2a8c8df, runId = 
d257c670-6a0c-49a8-af65-8cd40f32f1fc] terminated with error
   org.apache.hudi.exception.HoodieException: 
org.apache.hudi.exception.HoodieException: unable to read next record from 
parquet file
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:593)
at 
java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:677)
at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:735)
at 
java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
at 
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at 
org.apache.hudi.common.table.timeline.HoodieArchivedTimeline.loadInstants(HoodieArchivedTimeline.java:267)
at 
org.apache.hudi.common.table.timeline.CompletionTimeQueryView.load(CompletionTimeQueryView.java:323)
at 
org.apache.hudi.common.table.timeline.CompletionTimeQueryView.(CompletionTimeQueryView.java:99)
at 
org.apache.hudi.common.table.timeline.CompletionTimeQueryView.(CompletionTimeQueryView.java:84)
at 
org.apache.hudi.common.table.view.AbstractTableFileSystemView.init(AbstractTableFileSystemView.java:121)
at 
org.apache.hudi.common.table.view.HoodieTableFileSystemView.init(HoodieTableFileSystemView.java:115)
at 
org.apache.hudi.common.table.view.HoodieTableFileSystemView.(HoodieTableFileSystemView.java:109)
at 
org.apache.hudi.common.table.view.HoodieTableFileSystemView.(HoodieTableFileSystemView.java:100)
at 
org.apache.hudi.common.table.view.HoodieTableFileSystemView.(HoodieTableFileSystemView.java:179)
at 
org.apache.hudi.MergeOnReadIncrementalRelation.collectFileSplits(MergeOnReadIncrementalRelation.scala:110)
at 
org.apache.hudi.MergeOnReadIncrementalRelation.collectFileSplits(MergeOnReadIncrementalRelation.scala:46)
at 
org.apache.hudi.HoodieBaseRelation.buildScan(HoodieBaseRelation.scala:371)
at 
org.apache.spark.sql.hudi.streaming.HoodieStreamSource.getBatch(HoodieStreamSource.scala:177)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$3(MicroBatchExecution.scala:549)
at 
scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at 
org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:27)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
at 
org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:27)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$2(MicroBatchExecution.scala:545)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:375)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:373)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:545)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:256

Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10866:
URL: https://github.com/apache/hudi/pull/10866#issuecomment-2005673058

   
   ## CI report:
   
   * 538d59f11cfca29af8550391f0ce7a491279c0a9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22906)
 
   * 113f4e5717cdf97aed4f91cecd965c4cec02dec0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]

2024-03-18 Thread via GitHub


wombatu-kun commented on code in PR #10866:
URL: https://github.com/apache/hudi/pull/10866#discussion_r1529616940


##
hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerTestBase.java:
##
@@ -266,9 +266,9 @@ public void setupTest() {
 
   protected static void 
populateInvalidTableConfigFilePathProps(TypedProperties props, String 
dfsBasePath) {
 props.setProperty("hoodie.datasource.write.keygenerator.class", 
TestHoodieDeltaStreamer.TestGenerator.class.getName());
-
props.setProperty("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
"MMdd");
-props.setProperty("hoodie.deltastreamer.ingestion.tablesToBeIngested", 
"uber_db.dummy_table_uber");
-
props.setProperty("hoodie.deltastreamer.ingestion.uber_db.dummy_table_uber.configFile",
 dfsBasePath + "/config/invalid_uber_config.properties");
+props.setProperty("hoodie.streamer.keygen.timebased.output.dateformat", 
"MMdd");

Review Comment:
   Thank you, fixed it



##
hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerTestBase.java:
##
@@ -279,10 +279,10 @@ protected static void 
populateAllCommonProps(TypedProperties props, String dfsBa
 
   protected static void populateCommonProps(TypedProperties props, String 
dfsBasePath) {
 props.setProperty("hoodie.datasource.write.keygenerator.class", 
TestHoodieDeltaStreamer.TestGenerator.class.getName());
-
props.setProperty("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
"MMdd");
-props.setProperty("hoodie.deltastreamer.ingestion.tablesToBeIngested", 
"short_trip_db.dummy_table_short_trip,uber_db.dummy_table_uber");
-
props.setProperty("hoodie.deltastreamer.ingestion.uber_db.dummy_table_uber.configFile",
 dfsBasePath + "/config/uber_config.properties");
-
props.setProperty("hoodie.deltastreamer.ingestion.short_trip_db.dummy_table_short_trip.configFile",
 dfsBasePath + "/config/short_trip_uber_config.properties");
+props.setProperty("hoodie.streamer.keygen.timebased.output.dateformat", 
"MMdd");

Review Comment:
   fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]

2024-03-18 Thread via GitHub


wombatu-kun commented on code in PR #10866:
URL: https://github.com/apache/hudi/pull/10866#discussion_r1529616434


##
hudi-utilities/src/test/java/org/apache/hudi/utilities/config/SourceTestConfig.java:
##
@@ -27,22 +27,22 @@
 public class SourceTestConfig {
 
   public static final ConfigProperty NUM_SOURCE_PARTITIONS_PROP = 
ConfigProperty
-  .key("hoodie.deltastreamer.source.test.num_partitions")
+  .key("hoodie.streamer.source.test.num_partitions")

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Change drop.partitionpath behavior [hudi]

2024-03-18 Thread via GitHub


danny0405 commented on issue #10878:
URL: https://github.com/apache/hudi/issues/10878#issuecomment-2005652050

   > I went deeper and found that in Hudi 0.13.x and 0.14.x clustering do not 
honor flag hoodie.datasource.write.drop.partition.columns=true,
   
   Is this a bug?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7510] Loosen the compaction scheduling and rollback check for MDT [hudi]

2024-03-18 Thread via GitHub


danny0405 commented on code in PR #10874:
URL: https://github.com/apache/hudi/pull/10874#discussion_r1529609990


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1407,43 +1382,7 @@ protected void cleanIfNecessary(BaseHoodieWriteClient 
writeClient, String instan
 writeClient.lazyRollbackFailedIndexing();
   }
 
-  /**
-   * Validates the timeline for both main and metadata tables to ensure 
compaction on MDT can be scheduled.
-   */
-  protected boolean validateTimelineBeforeSchedulingCompaction(Option 
inFlightInstantTimestamp, String latestDeltaCommitTimeInMetadataTable) {
-// we need to find if there are any inflights in data table timeline 
before or equal to the latest delta commit in metadata table.
-// Whenever you want to change this logic, please ensure all below 
scenarios are considered.
-// a. There could be a chance that latest delta commit in MDT is committed 
in MDT, but failed in DT. And so findInstantsBeforeOrEquals() should be employed
-// b. There could be DT inflights after latest delta commit in MDT and we 
are ok with it. bcoz, the contract is, the latest compaction instant time in 
MDT represents
-// any instants before that is already synced with metadata table.
-// c. Do consider out of order commits. For eg, c4 from DT could complete 
before c3. and we can't trigger compaction in MDT with c4 as base instant time, 
until every
-// instant before c4 is synced with metadata table.
-List pendingInstants = 
dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested()
-
.findInstantsBeforeOrEquals(latestDeltaCommitTimeInMetadataTable).getInstants();
-
-if (!pendingInstants.isEmpty()) {
-  checkNumDeltaCommits(metadataMetaClient, 
dataWriteConfig.getMetadataConfig().getMaxNumDeltacommitsWhenPending());
-  LOG.info(String.format(
-  "Cannot compact metadata table as there are %d inflight instants in 
data table before latest deltacommit in metadata table: %s. Inflight instants 
in data table: %s",
-  pendingInstants.size(), latestDeltaCommitTimeInMetadataTable, 
Arrays.toString(pendingInstants.toArray(;
-  return false;
-}
-
-// Check if there are any pending compaction or log compaction instants in 
the timeline.
-// If pending compact/logCompaction operations are found abort scheduling 
new compaction/logCompaction operations.
-Option pendingLogCompactionInstant =

Review Comment:
   > validateTimelineBeforeSchedulingCompaction method basically checks if 
there are inflight instants or pending compaction or logcompaction plan on the 
timeline. If any such instants are found it will stop compaction or 
logcompaction from happening.
   
   We now can unlock this restriction because the MDT reader will filter out 
the invalid instants.
   
   > It also makes sure only one pending plan is present on the metadata 
table's timeline. That way it is easy to handle various cases on metadata table.
   
   Can it be in line with what we have for DT, because a pending compaction 
plan on MDT would block all the following-up scheduling of the compactions.
   
   > Pending logcompaction check present in HoodieLogCompactionPlanGenerator 
class, will exclude file groups that are currently under logcompaction, that 
means other file groups or partitions in metadata can still create another 
logcompaction plan.
   
   But there is no plan on the file slice that has pending plans 
(compaction/log_compaction), so the plan should be still valid?
   
   > One can argue that the, we should be able to schedule multiple 
logcompaction plans on the timeline, but we have seen operational issues when 
running logcompaction and compaction on metadata using table services. So, it 
is better keep these checks.
   
   I just wanna to make sure the strategy as simple and on par with the DT, so 
that the streaming use cases can also work well.
   
   cc @codope for helping to address some of these argue points maybe.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]

2024-03-18 Thread via GitHub


wombatu-kun commented on code in PR #10866:
URL: https://github.com/apache/hudi/pull/10866#discussion_r1529606592


##
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestJdbcSource.java:
##
@@ -73,12 +73,12 @@ public static void beforeAll() throws Exception {
   @BeforeEach
   public void setup() throws Exception {
 super.setup();
-PROPS.setProperty("hoodie.deltastreamer.jdbc.url", "jdbc:h2:mem:test_mem");
-PROPS.setProperty("hoodie.deltastreamer.jdbc.driver.class", 
"org.h2.Driver");
-PROPS.setProperty("hoodie.deltastreamer.jdbc.user", "test");
-PROPS.setProperty("hoodie.deltastreamer.jdbc.password", "jdbc");
-PROPS.setProperty("hoodie.deltastreamer.jdbc.table.name", "triprec");
-connection = DriverManager.getConnection("jdbc:h2:mem:test_mem", "test", 
"jdbc");
+PROPS.setProperty("hoodie.streamer.jdbc.url", "jdbc:h2:mem:test_mem");
+PROPS.setProperty("hoodie.streamer.jdbc.driver.class", "org.h2.Driver");
+PROPS.setProperty("hoodie.streamer.jdbc.user", "sa");
+PROPS.setProperty("hoodie.streamer.jdbc.password", "");
+PROPS.setProperty("hoodie.streamer.jdbc.table.name", "triprec");
+connection = DriverManager.getConnection("jdbc:h2:mem:test_mem", "sa", "");

Review Comment:
   reverted, will do it in separate PR.



##
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestJdbcSource.java:
##
@@ -438,7 +438,7 @@ public void testSourceWithStorageLevel() {
   private void writeSecretToFs() throws IOException {
 FileSystem fs = FileSystem.get(new Configuration());
 FSDataOutputStream outputStream = fs.create(new 
Path("file:///tmp/hudi/config/secret"));
-outputStream.writeBytes("jdbc");
+outputStream.writeBytes("");

Review Comment:
   reverted, will do it in separate PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]

2024-03-18 Thread via GitHub


wombatu-kun commented on code in PR #10866:
URL: https://github.com/apache/hudi/pull/10866#discussion_r1529606423


##
hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamer.java:
##
@@ -2418,15 +2418,15 @@ public void testSqlSourceSource() throws Exception {
   @Disabled
   @Test
   public void testJdbcSourceIncrementalFetchInContinuousMode() {
-try (Connection connection = 
DriverManager.getConnection("jdbc:h2:mem:test_mem", "test", "jdbc")) {
+try (Connection connection = 
DriverManager.getConnection("jdbc:h2:mem:test_mem", "sa", "")) {

Review Comment:
   Yes, it's from another fix. It's really better to do it in separate PR 
later. For now - reverted JDBC user name and password changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (b7ccecf3205 -> 7631e0dcb89)

2024-03-18 Thread stream2000
This is an automated email from the ASF dual-hosted git repository.

stream2000 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from b7ccecf3205 [HUDI-7492] Fix the incorrect keygenerator specification 
for multi partition or multi primary key tables creation (#10840)
 add 7631e0dcb89 [MINOR] Add Hudi icon for idea (#10880)

No new revisions were added by this update.

Summary of changes:
 .gitignore |   1 +
 .idea/icon.png | Bin 0 -> 14245 bytes
 2 files changed, 1 insertion(+)
 create mode 100644 .idea/icon.png



Re: [PR] [MINOR] Add Hudi icon for idea [hudi]

2024-03-18 Thread via GitHub


stream2000 merged PR #10880:
URL: https://github.com/apache/hudi/pull/10880


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Add Hudi icon for idea [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10880:
URL: https://github.com/apache/hudi/pull/10880#issuecomment-2005617775

   
   ## CI report:
   
   * 543cc32433495f4af12239307bf87a94dedacb94 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22946)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10845:
URL: https://github.com/apache/hudi/pull/10845#issuecomment-2005612033

   
   ## CI report:
   
   * c50e42d4b21dc1af358b61b0d814cfb50248bfe0 UNKNOWN
   * c15a59630c3d1def347ad50993177ce22adca790 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22945)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Add Hudi icon for idea [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10880:
URL: https://github.com/apache/hudi/pull/10880#issuecomment-2005612176

   
   ## CI report:
   
   * 543cc32433495f4af12239307bf87a94dedacb94 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Add Hudi icon for idea [hudi]

2024-03-18 Thread via GitHub


stream2000 commented on PR #10880:
URL: https://github.com/apache/hudi/pull/10880#issuecomment-2005612134

   Looks good, thanks for your contribution! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] Add Hudi icon for idea [hudi]

2024-03-18 Thread via GitHub


qidian99 opened a new pull request, #10880:
URL: https://github.com/apache/hudi/pull/10880

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7510] Loosen the compaction scheduling and rollback check for MDT [hudi]

2024-03-18 Thread via GitHub


danny0405 commented on code in PR #10874:
URL: https://github.com/apache/hudi/pull/10874#discussion_r1529562361


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1070,7 +1070,7 @@ public void update(HoodieRestoreMetadata restoreMetadata, 
String instantTime) {
   Map> partitionFilesToAdd = new HashMap<>();
   Map> partitionFilesToDelete = new HashMap<>();
   List partitionsToDelete = new ArrayList<>();
-  fetchOutofSyncFilesRecordsFromMetadataTable(dirInfoMap, 
partitionFilesToAdd, partitionFilesToDelete, partitionsToDelete);
+  fetchAutofSyncFilesRecordsFromMetadataTable(dirInfoMap, 
partitionFilesToAdd, partitionFilesToDelete, partitionsToDelete);

Review Comment:
   yes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7510] Loosen the compaction scheduling and rollback check for MDT [hudi]

2024-03-18 Thread via GitHub


suryaprasanna commented on code in PR #10874:
URL: https://github.com/apache/hudi/pull/10874#discussion_r1529421217


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1070,7 +1070,7 @@ public void update(HoodieRestoreMetadata restoreMetadata, 
String instantTime) {
   Map> partitionFilesToAdd = new HashMap<>();
   Map> partitionFilesToDelete = new HashMap<>();
   List partitionsToDelete = new ArrayList<>();
-  fetchOutofSyncFilesRecordsFromMetadataTable(dirInfoMap, 
partitionFilesToAdd, partitionFilesToDelete, partitionsToDelete);
+  fetchAutofSyncFilesRecordsFromMetadataTable(dirInfoMap, 
partitionFilesToAdd, partitionFilesToDelete, partitionsToDelete);

Review Comment:
   Is this a typo?



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1407,43 +1382,7 @@ protected void cleanIfNecessary(BaseHoodieWriteClient 
writeClient, String instan
 writeClient.lazyRollbackFailedIndexing();
   }
 
-  /**
-   * Validates the timeline for both main and metadata tables to ensure 
compaction on MDT can be scheduled.
-   */
-  protected boolean validateTimelineBeforeSchedulingCompaction(Option 
inFlightInstantTimestamp, String latestDeltaCommitTimeInMetadataTable) {
-// we need to find if there are any inflights in data table timeline 
before or equal to the latest delta commit in metadata table.
-// Whenever you want to change this logic, please ensure all below 
scenarios are considered.
-// a. There could be a chance that latest delta commit in MDT is committed 
in MDT, but failed in DT. And so findInstantsBeforeOrEquals() should be employed
-// b. There could be DT inflights after latest delta commit in MDT and we 
are ok with it. bcoz, the contract is, the latest compaction instant time in 
MDT represents
-// any instants before that is already synced with metadata table.
-// c. Do consider out of order commits. For eg, c4 from DT could complete 
before c3. and we can't trigger compaction in MDT with c4 as base instant time, 
until every
-// instant before c4 is synced with metadata table.
-List pendingInstants = 
dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested()
-
.findInstantsBeforeOrEquals(latestDeltaCommitTimeInMetadataTable).getInstants();
-
-if (!pendingInstants.isEmpty()) {
-  checkNumDeltaCommits(metadataMetaClient, 
dataWriteConfig.getMetadataConfig().getMaxNumDeltacommitsWhenPending());
-  LOG.info(String.format(
-  "Cannot compact metadata table as there are %d inflight instants in 
data table before latest deltacommit in metadata table: %s. Inflight instants 
in data table: %s",
-  pendingInstants.size(), latestDeltaCommitTimeInMetadataTable, 
Arrays.toString(pendingInstants.toArray(;
-  return false;
-}
-
-// Check if there are any pending compaction or log compaction instants in 
the timeline.
-// If pending compact/logCompaction operations are found abort scheduling 
new compaction/logCompaction operations.
-Option pendingLogCompactionInstant =

Review Comment:
   validateTimelineBeforeSchedulingCompaction method basically checks if there 
are inflight instants or pending compaction or logcompaction plan on the 
timeline. If any such instants are found it will stop compaction or 
logcompaction from happening.
   It also makes sure only one pending plan is present on the metadata table's 
timeline. That way it is easy to handle various cases on metadata table.
   Pending logcompaction check present in HoodieLogCompactionPlanGenerator 
class, will exclude file groups that are currently under logcompaction, that 
means other file groups or partitions in metadata can still create another 
logcompaction plan. So, the checks on pending compaction and logcompaction 
instants is required. One can argue that the, we should be able to schedule 
multiple logcompaction plans on the timeline, but we have seen operational 
issues when running logcompaction and compaction on metadata using table 
services. So, it is better keep these checks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7510] Loosen the compaction scheduling and rollback check for MDT [hudi]

2024-03-18 Thread via GitHub


danny0405 commented on code in PR #10874:
URL: https://github.com/apache/hudi/pull/10874#discussion_r1529560937


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1130,22 +1128,6 @@ public void update(HoodieRollbackMetadata 
rollbackMetadata, String instantTime)
 }
   }
 
-  protected void validateRollback(
-  String commitToRollbackInstantTime,
-  HoodieInstant compactionInstant,

Review Comment:
   We can do this now becase the MOR rollback is changed to base on file 
deletion similiar with the COW table.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]

2024-03-18 Thread via GitHub


danny0405 commented on code in PR #10845:
URL: https://github.com/apache/hudi/pull/10845#discussion_r1529551851


##
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/common/table/timeline/TestHoodieGlobalTimeline.java:
##
@@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.table.timeline;
+
+import org.apache.hudi.DummyActiveAction;
+import org.apache.hudi.client.timeline.LSMTimelineWriter;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.engine.HoodieLocalEngineContext;
+import org.apache.hudi.common.engine.LocalTaskContextSupplier;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.common.testutils.HoodieCommonTestHarness;
+import org.apache.hudi.common.testutils.HoodieTestTable;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.index.HoodieIndex;
+
+import org.apache.hadoop.conf.Configuration;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.ValueSource;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+import static org.junit.jupiter.api.Assertions.assertDoesNotThrow;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Test cases for {@link HoodieGlobalTimeline}.
+ */
+public class TestHoodieGlobalTimeline extends HoodieCommonTestHarness {
+  @BeforeEach
+  public void setUp() throws Exception {
+initMetaClient();
+  }
+
+  @AfterEach
+  public void tearDown() throws Exception {
+cleanMetaClient();
+  }
+
+  /**
+   * The test for checking whether an instant is archived.
+   */
+  @Test
+  void testArchivingCheck() throws Exception {
+writeArchivedTimeline(10, 1000, 50);
+writeActiveTimeline(1050, 10);
+HoodieGlobalTimeline globalTimeline = new 
HoodieGlobalTimeline(this.metaClient, Option.empty());
+assertTrue(globalTimeline.isBeforeTimelineStarts("1049"), "The instant 
should be active");

Review Comment:
   Renames `containsOrBeforeTimelineStarts` to `isValidInstant` and 
`isBeforeTimelineStarts` to `isArchived`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10845:
URL: https://github.com/apache/hudi/pull/10845#issuecomment-2005552101

   
   ## CI report:
   
   * c50e42d4b21dc1af358b61b0d814cfb50248bfe0 UNKNOWN
   * 948d9ecb6dc661628b787ba800756b78d52791af Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22885)
 
   * c15a59630c3d1def347ad50993177ce22adca790 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22945)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Comment Edited] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-03-18 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828123#comment-17828123
 ] 

sivabalan narayanan edited comment on HUDI-7507 at 3/19/24 1:12 AM:


Just trying to replay the same scenario for data table, conflict resolution 
could have aborted job2. and hence we may not hit the same issue. 


was (Author: shivnarayan):
Just trying to replay the same scenario for data table, conflict resolution 
could have aborted job2. and hence we may not hit the same issue. 

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Krishen Bhan
>Priority: Major
>
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately this has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan, even 
> if its an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this
> An alternate approach (suggested by [~pwason] ) was to instead have all 
> operations including table services perform conflict resolution checks before 
> committing. For example, clean and compaction would generate their plan as 
> usual. But when creating a transaction to write a .requested file, right 
> before creating the file they should check if another lower timestamp instant 
> has appeared in the timeline. And if so, they should fail/abort without 
> creating the plan. Commit operations would also be updated/verified to have 
> similar check, before creating a .requested file (during a transaction) the 
> commit operation will check if a table service plan (clean/compact) with a 
> greater instant time has been created. And if so, would abort/fail. This 
> avoids the drawbacks of the first approach, but will lead to more transient 
> failures that users have to handle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-03-18 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828142#comment-17828142
 ] 

sivabalan narayanan commented on HUDI-7507:
---

We have already fixed it w/ latest master (1.0) by generating the new commit 
times using locks. That should solve the issue. We can apply the same to 0.X 
branch. 

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Krishen Bhan
>Priority: Major
>
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately this has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan, even 
> if its an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this
> An alternate approach (suggested by [~pwason] ) was to instead have all 
> operations including table services perform conflict resolution checks before 
> committing. For example, clean and compaction would generate their plan as 
> usual. But when creating a transaction to write a .requested file, right 
> before creating the file they should check if another lower timestamp instant 
> has appeared in the timeline. And if so, they should fail/abort without 
> creating the plan. Commit operations would also be updated/verified to have 
> similar check, before creating a .requested file (during a transaction) the 
> commit operation will check if a table service plan (clean/compact) with a 
> greater instant time has been created. And if so, would abort/fail. This 
> avoids the drawbacks of the first approach, but will lead to more transient 
> failures that users have to handle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10845:
URL: https://github.com/apache/hudi/pull/10845#issuecomment-2005544453

   
   ## CI report:
   
   * c50e42d4b21dc1af358b61b0d814cfb50248bfe0 UNKNOWN
   * 948d9ecb6dc661628b787ba800756b78d52791af Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22885)
 
   * c15a59630c3d1def347ad50993177ce22adca790 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Spark snapshot query against MOR table data written by Flink gives an incorrect timestamp [hudi]

2024-03-18 Thread via GitHub


danny0405 commented on issue #10879:
URL: https://github.com/apache/hudi/issues/10879#issuecomment-2005456434

   Did you specify the read options like `read.utc-timezone`, by default it is 
true, and recently we also support the write utc timezone option in: 
https://github.com/apache/hudi/pull/10594


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-18 Thread via GitHub


danny0405 commented on code in PR #10851:
URL: https://github.com/apache/hudi/pull/10851#discussion_r1529509534


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##
@@ -725,40 +725,45 @@ private FlinkOptions() {
   .withDescription("Target IO in MB for per compaction (both read and 
write), default 500 GB");
 
   public static final ConfigOption CLEAN_ASYNC_ENABLED = ConfigOptions
-  .key("clean.async.enabled")
+  .key("hoodie.cleaner.async.enabled")

Review Comment:
   I kind of think the sql options should not be too long, and we keep 
compatibility for hudi altenatives.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-18 Thread via GitHub


rmahindra123 commented on code in PR #10876:
URL: https://github.com/apache/hudi/pull/10876#discussion_r1529460065


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java:
##
@@ -90,8 +94,11 @@ public class UpsertPartitioner extends 
SparkHoodiePartitioner {
   public UpsertPartitioner(WorkloadProfile profile, HoodieEngineContext 
context, HoodieTable table,

Review Comment:
   Should we add the implementation to a new class, may be 
sortedUpsertPartitioner or something, so there is a clean separation. We can 
use the same config to control which one gets called.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-03-18 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828123#comment-17828123
 ] 

sivabalan narayanan commented on HUDI-7507:
---

Just trying to replay the same scenario for data table, conflict resolution 
could have aborted job2. and hence we may not hit the same issue. 

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Krishen Bhan
>Priority: Major
>
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately this has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan, even 
> if its an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this
> An alternate approach (suggested by [~pwason] ) was to instead have all 
> operations including table services perform conflict resolution checks before 
> committing. For example, clean and compaction would generate their plan as 
> usual. But when creating a transaction to write a .requested file, right 
> before creating the file they should check if another lower timestamp instant 
> has appeared in the timeline. And if so, they should fail/abort without 
> creating the plan. Commit operations would also be updated/verified to have 
> similar check, before creating a .requested file (during a transaction) the 
> commit operation will check if a table service plan (clean/compact) with a 
> greater instant time has been created. And if so, would abort/fail. This 
> avoids the drawbacks of the first approach, but will lead to more transient 
> failures that users have to handle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7508] Avoid collecting records in HoodieStreamerUtils.createHoodieRecords and JsonKafkaSource mapPartitions [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10872:
URL: https://github.com/apache/hudi/pull/10872#issuecomment-2005124255

   
   ## CI report:
   
   * 629e91bc0267c0728b98326eb84072965c600205 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22928)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7501] Use source profile for S3 and GCS sources [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10861:
URL: https://github.com/apache/hudi/pull/10861#issuecomment-2005007838

   
   ## CI report:
   
   * 896491233f44039e8874d5a3080dd686fffd044e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22944)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7508] Avoid collecting records in HoodieStreamerUtils.createHoodieRecords and JsonKafkaSource mapPartitions [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10872:
URL: https://github.com/apache/hudi/pull/10872#issuecomment-2004996708

   
   ## CI report:
   
   * 629e91bc0267c0728b98326eb84072965c600205 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22928)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7508] Avoid collecting records in HoodieStreamerUtils.createHoodieRecords and JsonKafkaSource mapPartitions [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10872:
URL: https://github.com/apache/hudi/pull/10872#issuecomment-2004984687

   
   ## CI report:
   
   * 629e91bc0267c0728b98326eb84072965c600205 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Changing the Properties to Load From Both Default Path and Enviorment [hudi]

2024-03-18 Thread via GitHub


yihua commented on PR #10835:
URL: https://github.com/apache/hudi/pull/10835#issuecomment-2004976726

   @Amar1404 any update on the PR and @CTTY ’s suggestion?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7501] Use source profile for S3 and GCS sources [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10861:
URL: https://github.com/apache/hudi/pull/10861#issuecomment-2004847718

   
   ## CI report:
   
   * d934cf86d5375c3dec64f0bb171f455aa488e7fc Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22943)
 
   * 896491233f44039e8874d5a3080dd686fffd044e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22944)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7504] replace expensive existence check with spark options [hudi]

2024-03-18 Thread via GitHub


yihua commented on code in PR #10865:
URL: https://github.com/apache/hudi/pull/10865#discussion_r1529188995


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java:
##
@@ -112,10 +110,15 @@ public S3EventsHoodieIncrSource(
   QueryRunner queryRunner,
   CloudDataFetcher cloudDataFetcher) {
 super(props, sparkContext, sparkSession, schemaProvider);
+
+if (getBooleanWithAltKeys(props, ENABLE_EXISTS_CHECK)) {
+  sparkSession.conf().set("spark.sql.files.ignoreMissingFiles", "true");
+  sparkSession.conf().set("spark.sql.files.ignoreCorruptFiles", "true");

Review Comment:
   See spark docs: 
https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#ignore-missing-files
 `Spark allows you to use the configuration spark.sql.files.ignoreMissingFiles 
or the data source option ignoreMissingFiles to ignore missing files while 
reading data from files.`
   
   You need to set `.option("ignoreMissingFiles")` to achieve the behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7501] Use source profile for S3 and GCS sources [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10861:
URL: https://github.com/apache/hudi/pull/10861#issuecomment-2004824804

   
   ## CI report:
   
   * d934cf86d5375c3dec64f0bb171f455aa488e7fc Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22943)
 
   * 896491233f44039e8874d5a3080dd686fffd044e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7436] Fix the conditions for determining whether the records need to be rewritten [hudi]

2024-03-18 Thread via GitHub


yihua commented on code in PR #10727:
URL: https://github.com/apache/hudi/pull/10727#discussion_r1529152799


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java:
##
@@ -202,7 +202,9 @@ private Option> 
composeSchemaEvolutionTrans
   Schema newWriterSchema = 
AvroInternalSchemaConverter.convert(mergedSchema, writerSchema.getFullName());
   Schema writeSchemaFromFile = 
AvroInternalSchemaConverter.convert(writeInternalSchema, 
newWriterSchema.getFullName());
   boolean needToReWriteRecord = sameCols.size() != 
colNamesFromWriteSchema.size()
-  || 
SchemaCompatibility.checkReaderWriterCompatibility(newWriterSchema, 
writeSchemaFromFile).getType() == 
org.apache.avro.SchemaCompatibility.SchemaCompatibilityType.COMPATIBLE;
+  && 
SchemaCompatibility.checkReaderWriterCompatibility(newWriterSchema, 
writeSchemaFromFile).getType()
+  == 
org.apache.avro.SchemaCompatibility.SchemaCompatibilityType.COMPATIBLE;
+

Review Comment:
   @xiarixiaoyao This info is valuable.  Basically using pruned schema to read 
Avro records is supported on Avro 1.10 and above, not on lower versions.  I see 
that Spark 3.2 and above and all Flink versions use Avro 1.10 and above.  So 
for these integrations and others that rely on Avro 1.10 and above, we should 
use pruned schema to read log records to improve performance.  I'll check the 
new file group reader.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]

2024-03-18 Thread via GitHub


yihua commented on PR #9717:
URL: https://github.com/apache/hudi/pull/9717#issuecomment-2004747912

   > @yihua : We need to use Hudi with Spark 3.5. Can you let me know when is 
Hudi 0.15.0 release planned?
   
   The 0.15.0 release branch is planned to be cut this month once we verify 
engine integrations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]

2024-03-18 Thread via GitHub


yihua commented on code in PR #10866:
URL: https://github.com/apache/hudi/pull/10866#discussion_r1529100966


##
hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamer.java:
##
@@ -2418,15 +2418,15 @@ public void testSqlSourceSource() throws Exception {
   @Disabled
   @Test
   public void testJdbcSourceIncrementalFetchInContinuousMode() {
-try (Connection connection = 
DriverManager.getConnection("jdbc:h2:mem:test_mem", "test", "jdbc")) {
+try (Connection connection = 
DriverManager.getConnection("jdbc:h2:mem:test_mem", "sa", "")) {

Review Comment:
   Looks like this is from another fix.  Could the JDBC user and password be 
put into static variables to avoid any misconfigs in tests in the future?



##
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestJdbcSource.java:
##
@@ -438,7 +438,7 @@ public void testSourceWithStorageLevel() {
   private void writeSecretToFs() throws IOException {
 FileSystem fs = FileSystem.get(new Configuration());
 FSDataOutputStream outputStream = fs.create(new 
Path("file:///tmp/hudi/config/secret"));
-outputStream.writeBytes("jdbc");
+outputStream.writeBytes("");

Review Comment:
   Similar here for the password.  And should we add the password back for 
testing the secret (this refactoring of static variables and password should be 
in a separate PR)?



##
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestJdbcSource.java:
##
@@ -73,12 +73,12 @@ public static void beforeAll() throws Exception {
   @BeforeEach
   public void setup() throws Exception {
 super.setup();
-PROPS.setProperty("hoodie.deltastreamer.jdbc.url", "jdbc:h2:mem:test_mem");
-PROPS.setProperty("hoodie.deltastreamer.jdbc.driver.class", 
"org.h2.Driver");
-PROPS.setProperty("hoodie.deltastreamer.jdbc.user", "test");
-PROPS.setProperty("hoodie.deltastreamer.jdbc.password", "jdbc");
-PROPS.setProperty("hoodie.deltastreamer.jdbc.table.name", "triprec");
-connection = DriverManager.getConnection("jdbc:h2:mem:test_mem", "test", 
"jdbc");
+PROPS.setProperty("hoodie.streamer.jdbc.url", "jdbc:h2:mem:test_mem");
+PROPS.setProperty("hoodie.streamer.jdbc.driver.class", "org.h2.Driver");
+PROPS.setProperty("hoodie.streamer.jdbc.user", "sa");
+PROPS.setProperty("hoodie.streamer.jdbc.password", "");
+PROPS.setProperty("hoodie.streamer.jdbc.table.name", "triprec");
+connection = DriverManager.getConnection("jdbc:h2:mem:test_mem", "sa", "");

Review Comment:
   Similar here for the JDBC user name and password.



##
hudi-utilities/src/test/resources/streamer-config/invalid_hive_sync_uber_config.properties:
##
@@ -18,6 +18,6 @@
 include=base.properties
 hoodie.datasource.write.recordkey.field=_row_key
 hoodie.datasource.write.partitionpath.field=created_at
-hoodie.deltastreamer.source.kafka.topic=test_topic
-hoodie.deltastreamer.keygen.timebased.timestamp.type=UNIX_TIMESTAMP
-hoodie.deltastreamer.keygen.timebased.input.dateformat=-MM-dd
\ No newline at end of file
+hoodie.streamer.source.kafka.topic=test_topic
+hoodie.streamer.keygen.timebased.timestamp.type=UNIX_TIMESTAMP

Review Comment:
   reminder for fixing keygen config.



##
hudi-utilities/src/test/resources/streamer-config/uber_config.properties:
##
@@ -18,10 +18,10 @@
 include=base.properties
 hoodie.datasource.write.recordkey.field=_row_key
 hoodie.datasource.write.partitionpath.field=created_at
-hoodie.deltastreamer.source.kafka.topic=topic1
-hoodie.deltastreamer.keygen.timebased.timestamp.type=UNIX_TIMESTAMP
-hoodie.deltastreamer.keygen.timebased.input.dateformat=-MM-dd HH:mm:ss.S
+hoodie.streamer.source.kafka.topic=topic1
+hoodie.streamer.keygen.timebased.timestamp.type=UNIX_TIMESTAMP

Review Comment:
   reminder for fixing keygen config.



##
hudi-utilities/src/test/resources/streamer-config/short_trip_uber_config.properties:
##
@@ -18,11 +18,11 @@
 include=base.properties
 hoodie.datasource.write.recordkey.field=_row_key
 hoodie.datasource.write.partitionpath.field=created_at
-hoodie.deltastreamer.source.kafka.topic=topic2
-hoodie.deltastreamer.keygen.timebased.timestamp.type=UNIX_TIMESTAMP
-hoodie.deltastreamer.keygen.timebased.input.dateformat=-MM-dd HH:mm:ss.S
+hoodie.streamer.source.kafka.topic=topic2
+hoodie.streamer.keygen.timebased.timestamp.type=UNIX_TIMESTAMP

Review Comment:
   reminder for fixing keygen config.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]

2024-03-18 Thread via GitHub


yihua commented on code in PR #10866:
URL: https://github.com/apache/hudi/pull/10866#discussion_r1529083756


##
hudi-utilities/src/test/java/org/apache/hudi/utilities/config/SourceTestConfig.java:
##
@@ -27,22 +27,22 @@
 public class SourceTestConfig {
 
   public static final ConfigProperty NUM_SOURCE_PARTITIONS_PROP = 
ConfigProperty
-  .key("hoodie.deltastreamer.source.test.num_partitions")
+  .key("hoodie.streamer.source.test.num_partitions")

Review Comment:
   Let's add `.withAlternatives()` to preserve the old config naming to be 
backward compatible and use `STREAMER_CONFIG_PREFIX` and 
`DELTA_STREAMER_CONFIG_PREFIX`.  Similar for other configs in this class, 
`SourceTestConfig`.  See `HoodieStreamerConfig.CHECKPOINT_PROVIDER_PATH` for 
reference:
   ```
 public static final ConfigProperty CHECKPOINT_PROVIDER_PATH = 
ConfigProperty
 .key(STREAMER_CONFIG_PREFIX + "checkpoint.provider.path")
 .noDefaultValue()
 .withAlternatives(DELTA_STREAMER_CONFIG_PREFIX + 
"checkpoint.provider.path")
 .markAdvanced()
 .withDocumentation("The path for providing the checkpoints.");
   ```



##
hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerTestBase.java:
##
@@ -266,9 +266,9 @@ public void setupTest() {
 
   protected static void 
populateInvalidTableConfigFilePathProps(TypedProperties props, String 
dfsBasePath) {
 props.setProperty("hoodie.datasource.write.keygenerator.class", 
TestHoodieDeltaStreamer.TestGenerator.class.getName());
-
props.setProperty("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
"MMdd");
-props.setProperty("hoodie.deltastreamer.ingestion.tablesToBeIngested", 
"uber_db.dummy_table_uber");
-
props.setProperty("hoodie.deltastreamer.ingestion.uber_db.dummy_table_uber.configFile",
 dfsBasePath + "/config/invalid_uber_config.properties");
+props.setProperty("hoodie.streamer.keygen.timebased.output.dateformat", 
"MMdd");

Review Comment:
   For key generator-related configs, the config prefix is renamed from 
`hoodie.deltastreamer.keygen.timebased.` to `hoodie.keygen.timebased.` (see 
class `TimestampKeyGeneratorConfig`).



##
hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerTestBase.java:
##
@@ -279,10 +279,10 @@ protected static void 
populateAllCommonProps(TypedProperties props, String dfsBa
 
   protected static void populateCommonProps(TypedProperties props, String 
dfsBasePath) {
 props.setProperty("hoodie.datasource.write.keygenerator.class", 
TestHoodieDeltaStreamer.TestGenerator.class.getName());
-
props.setProperty("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
"MMdd");
-props.setProperty("hoodie.deltastreamer.ingestion.tablesToBeIngested", 
"short_trip_db.dummy_table_short_trip,uber_db.dummy_table_uber");
-
props.setProperty("hoodie.deltastreamer.ingestion.uber_db.dummy_table_uber.configFile",
 dfsBasePath + "/config/uber_config.properties");
-
props.setProperty("hoodie.deltastreamer.ingestion.short_trip_db.dummy_table_short_trip.configFile",
 dfsBasePath + "/config/short_trip_uber_config.properties");
+props.setProperty("hoodie.streamer.keygen.timebased.output.dateformat", 
"MMdd");

Review Comment:
   Similar here and other places for key generator configs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]

2024-03-18 Thread via GitHub


vinothchandar commented on code in PR #10845:
URL: https://github.com/apache/hudi/pull/10845#discussion_r1529092922


##
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/common/table/timeline/TestHoodieGlobalTimeline.java:
##
@@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.table.timeline;
+
+import org.apache.hudi.DummyActiveAction;
+import org.apache.hudi.client.timeline.LSMTimelineWriter;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.engine.HoodieLocalEngineContext;
+import org.apache.hudi.common.engine.LocalTaskContextSupplier;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.common.testutils.HoodieCommonTestHarness;
+import org.apache.hudi.common.testutils.HoodieTestTable;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.index.HoodieIndex;
+
+import org.apache.hadoop.conf.Configuration;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.ValueSource;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+import static org.junit.jupiter.api.Assertions.assertDoesNotThrow;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Test cases for {@link HoodieGlobalTimeline}.
+ */
+public class TestHoodieGlobalTimeline extends HoodieCommonTestHarness {
+  @BeforeEach
+  public void setUp() throws Exception {
+initMetaClient();
+  }
+
+  @AfterEach
+  public void tearDown() throws Exception {
+cleanMetaClient();
+  }
+
+  /**
+   * The test for checking whether an instant is archived.
+   */
+  @Test
+  void testArchivingCheck() throws Exception {
+writeArchivedTimeline(10, 1000, 50);
+writeActiveTimeline(1050, 10);
+HoodieGlobalTimeline globalTimeline = new 
HoodieGlobalTimeline(this.metaClient, Option.empty());
+assertTrue(globalTimeline.isBeforeTimelineStarts("1049"), "The instant 
should be active");

Review Comment:
   works for now
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7486] Classify schema exceptions when converting from avro to spark row representation [hudi]

2024-03-18 Thread via GitHub


yihua commented on code in PR #10778:
URL: https://github.com/apache/hudi/pull/10778#discussion_r1529060885


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala:
##
@@ -58,9 +59,19 @@ object AvroConversionUtils {
*/
   def createInternalRowToAvroConverter(rootCatalystType: StructType, 
rootAvroType: Schema, nullable: Boolean): InternalRow => GenericRecord = {
 val serializer = sparkAdapter.createAvroSerializer(rootCatalystType, 
rootAvroType, nullable)
-row => serializer
-  .serialize(row)
-  .asInstanceOf[GenericRecord]
+row => {
+  try {
+serializer
+  .serialize(row)
+  .asInstanceOf[GenericRecord]
+  } catch {
+case e: HoodieSchemaException => throw e
+case e => throw new SchemaCompatibilityException("Failed to convert 
spark record into avro record", e)
+  }
+

Review Comment:
   nit: remove empty line



##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala:
##
@@ -58,9 +59,19 @@ object AvroConversionUtils {
*/
   def createInternalRowToAvroConverter(rootCatalystType: StructType, 
rootAvroType: Schema, nullable: Boolean): InternalRow => GenericRecord = {
 val serializer = sparkAdapter.createAvroSerializer(rootCatalystType, 
rootAvroType, nullable)
-row => serializer
-  .serialize(row)
-  .asInstanceOf[GenericRecord]
+row => {
+  try {
+serializer
+  .serialize(row)
+  .asInstanceOf[GenericRecord]
+  } catch {
+case e: HoodieSchemaException => throw e
+case e => throw new SchemaCompatibilityException("Failed to convert 
spark record into avro record", e)
+  }
+
+}
+
+

Review Comment:
   Nit: remove two empty lines



##
hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/HoodieStreamerUtils.java:
##
@@ -159,7 +156,16 @@ public static Option> 
createHoodieRecords(HoodieStreamer.C
* @return the representation of error record (empty {@link HoodieRecord} 
and the error record
* String) for writing to error table.
*/
-  private static Either 
generateErrorRecord(GenericRecord genRec) {
+  private static Either 
generateErrorRecordOrThrowException(GenericRecord genRec, Exception e, boolean 
shouldErrorTable) {
+if (!shouldErrorTable) {
+  if (e instanceof HoodieKeyException) {
+throw (HoodieKeyException) e;
+  } else if (e instanceof HoodieKeyGeneratorException) {
+throw (HoodieKeyGeneratorException) e;

Review Comment:
   Nit: could the `if` and `else if` be merged like before?  Also, there is no 
need to do type cast here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7501] Use source profile for S3 and GCS sources [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10861:
URL: https://github.com/apache/hudi/pull/10861#issuecomment-2004464696

   
   ## CI report:
   
   * d934cf86d5375c3dec64f0bb171f455aa488e7fc Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22943)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Spark snapshot query against MOR table data written by Flink gives an incorrect timestamp [hudi]

2024-03-18 Thread via GitHub


dderjugin opened a new issue, #10879:
URL: https://github.com/apache/hudi/issues/10879

   I'm writing data using Flink DataStream API to MoR table with partitioning 
by a column of Long type and PK as Timestamp field.
   Spark SNAPSHOT query gives an incorrect value of the latest records: 
+21971-04-23 04:46:37 instead of 1990-01-01 07:51:09.997
   Read optimized query gives a correct result.
   
   **To Reproduce**
   
   1. Push data to MOR table using Flink 1.17.2 DataStream API. PK is 
timestamp, partitioning field is Long.
   2. Query the table data using Spark snapshot query: "select count(*), 
min(ts), max(ts) from [table]"
   3. The max value of "ts" column is incorrect: +21971-04-23 04:46:37
   
   **Expected behavior**
   Max "ts" column value should be "1990-01-01 07:51:09.997"
   
   Examples:
   - correct result using read optimized query:
   ++---+---+
   |count(1)|min(ts)|max(ts)|
   ++---+---+
   |5166373 |1989-12-31 23:59:50|1990-01-01 05:47:39.998|
   ++---+---+
   
   - incorrect result using snapshot query:
   ++---+-+
   |count(1)|min(ts)|max(ts)  |
   ++---+-+
   |6033156 |1989-12-31 23:59:50|+21971-02-25 05:42:13|
   ++---+-+
   
   Detailed query shows that only PK column value is incorrect and the rest of 
the table columns have expected values. 
   Data in the corresponding parquet file looks correct for all columns 
including PK column.
   
   **Environment Description**
   
   * Hudi version : 0.14.1
   
   * Spark version : 3.4.2
   
   * Hive version : 2.3.9
   
   * Hadoop version : 3.3.6
   
   * Storage (HDFS/S3/GCS..) : local filesystem
   
   * Running on Docker? (yes/no) : no
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Insert overwrite with replacement instant cannot execute archive [hudi]

2024-03-18 Thread via GitHub


xuzifu666 commented on issue #10873:
URL: https://github.com/apache/hudi/issues/10873#issuecomment-2004342507

   > No it's not archiving, But Why you think they should be archived. As all 
these commits are still valid and should be read in this case, so they should 
be active only.
   
   yes it is not should archive


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7501] Use source profile for S3 and GCS sources [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10861:
URL: https://github.com/apache/hudi/pull/10861#issuecomment-2004326001

   
   ## CI report:
   
   * 8a6413f86714acd2f2f31a66af42b66526c23665 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22942)
 
   * d934cf86d5375c3dec64f0bb171f455aa488e7fc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22943)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7501] Use source profile for S3 and GCS sources [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10861:
URL: https://github.com/apache/hudi/pull/10861#issuecomment-2004163320

   
   ## CI report:
   
   * 8a6413f86714acd2f2f31a66af42b66526c23665 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22942)
 
   * d934cf86d5375c3dec64f0bb171f455aa488e7fc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7501] Use source profile for S3 and GCS sources [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10861:
URL: https://github.com/apache/hudi/pull/10861#issuecomment-2004144844

   
   ## CI report:
   
   * 8a6413f86714acd2f2f31a66af42b66526c23665 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22942)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Insert overwrite with replacement instant cannot execute archive [hudi]

2024-03-18 Thread via GitHub


ad1happy2go commented on issue #10873:
URL: https://github.com/apache/hudi/issues/10873#issuecomment-2004127645

   No it's not archiving, But Why you think they should be archived. As all 
these commits are still valid and should be read in this case, so they should 
be active only. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Insert overwrite with replacement instant cannot execute archive [hudi]

2024-03-18 Thread via GitHub


xuzifu666 commented on issue #10873:
URL: https://github.com/apache/hudi/issues/10873#issuecomment-2004120534

   > yes there was no commit which was archiving.
   
   Hi, @ad1happy2go this case is alse archived?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Insert overwrite with replacement instant cannot execute archive [hudi]

2024-03-18 Thread via GitHub


ad1happy2go commented on issue #10873:
URL: https://github.com/apache/hudi/issues/10873#issuecomment-2004095974

   @xuzifu666 In case if you saying this to update above code - 
   
   ```
   for i in range(1,100):
   df = df.withColumn("city", lit(i))
   
(df.write.format("hudi").options(**hudi_options).mode("append").save(PATH))
   ```
   
   With above change, yes there was no commit which was archiving. But I am 
thinking why they will be even archived, as all partitions data will still be 
valid and all the commits are valid and should be active.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7501] Use source profile for S3 and GCS sources [hudi]

2024-03-18 Thread via GitHub


vinishjail97 commented on PR #10861:
URL: https://github.com/apache/hudi/pull/10861#issuecomment-2004067378

   > Is it possible to get a test coverage for the source classes touched and 
share the report.
   > 
   > Generally, GcsEventsHoodieIncrSource and S3EventsHoodieIncrSource does not 
have any function tests right. so, we rely heavily on mocked UTs.
   
   GcsEventsHoodieIncrSource, S3EventsHoodieIncrSource, CloudDataFetcher and 
CloudObjectsSelectorCommon have decent coverage 
   
   https://github.com/apache/hudi/assets/16958856/b98d1b24-9598-4cb9-b9f7-53e473cbef77";>
   
   https://github.com/apache/hudi/assets/16958856/56c013a0-b0f1-4ce2-be5d-d992b207f6b5";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7501] Use source profile for S3 and GCS sources [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10861:
URL: https://github.com/apache/hudi/pull/10861#issuecomment-2004035486

   
   ## CI report:
   
   * c2f8413cf59f8dd3d14ab3bc638cf7e2eb989d80 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22896)
 
   * 8a6413f86714acd2f2f31a66af42b66526c23665 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7504] replace expensive existence check with spark options [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10865:
URL: https://github.com/apache/hudi/pull/10865#issuecomment-2003639403

   
   ## CI report:
   
   * 06e57fd2e8a4cb8276f7d718cf9614757e6e5788 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22940)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Change drop.partitionpath behavior [hudi]

2024-03-18 Thread via GitHub


VitoMakarevich opened a new issue, #10878:
URL: https://github.com/apache/hudi/issues/10878

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   Hudi 0.12.2 version.
   Hello, we are now using Hudi COW with 
`hoodie.datasource.write.drop.partition.columns=true`(non-default). And 
recently we wanted to run clustering for this dataset. The problem we observe 
is that `bulk_insert` used in clustering does not honor this setting and 
`parquet` files are created with partition columns. As far as I checked it's 
not a concern for reading via Spark, but it causes problem for upserting for 
those partitions where we had clustering. If an update comes to such file, in 
0.12.x we see failures like `Parquet/Avro schema mismatch: Avro field 
'partition1' not found`.
   [Reproduction 
here](https://github.com/VitoMakarevich/hudi-incremental-issue/blob/master/src/main/scala/com/example/hudi/HudiClusteringPartitionColumn.scala)
   I went deeper and found that in Hudi 0.13.x and 0.14.x clustering do not 
honor flag `hoodie.datasource.write.drop.partition.columns=true`, but there it 
does not cause problems with upsert - partitions can udpated.
   Given this, we probably cannot upgrade now, but would like to execute 
clustering. So I need help with the following question:
   1. Is it safe to change `hoodie.datasource.write.drop.partition.columns` 
from `true` to `false`? As far as I observed it looks to be working, but.
   2. How it can be changed? Since changing it in write options does not help, 
because it's written into `hoodie.properties` file - if I change the contents 
of this file - it works this way: subsequent update appends partition column 
value, but only for changed rows, old remains with `null`. However, in the read 
path, old rows have this field populated(some magic). And once clustering is 
run - all rows have field populated.
   Important: we would not be able to execute clustering all table at once - 
likely, so if we run it in batches, is it safe to have part of the files 
including partition columns and part not?
   
   Nice to answer questions:
   1. Do you know what might cause change of behavior in 0.12.3 -> 0.13.0 so 
that the write works after clustering?
   
   **To Reproduce**
   
   [Reproduction 
here](https://github.com/VitoMakarevich/hudi-incremental-issue/blob/master/src/main/scala/com/example/hudi/HudiClusteringPartitionColumn.scala)
   
   
   **Environment Description**
   
   * Hudi version : 0.12.2 - 0.14.x
   
   * Spark version : 3.3.0
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```
   Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
to stage failure: Task 0 in stage 128.0 failed 1 times, most recent failure: 
Lost task 0.0 in stage 128.0 (TID 531) (192.168.0.143 executor driver): 
org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
UPDATE for partition :0
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:329)
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244)
at 
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
at 
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378)
at 
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1525)
at 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)

Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2003610845

   
   ## CI report:
   
   * 5016a9c8d9daeea9f6f28f63cc090514482571a4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22941)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7504] replace expensive existence check with spark options [hudi]

2024-03-18 Thread via GitHub


bhat-vinay commented on code in PR #10865:
URL: https://github.com/apache/hudi/pull/10865#discussion_r1528192748


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java:
##
@@ -112,10 +110,15 @@ public S3EventsHoodieIncrSource(
   QueryRunner queryRunner,
   CloudDataFetcher cloudDataFetcher) {
 super(props, sparkContext, sparkSession, schemaProvider);
+
+if (getBooleanWithAltKeys(props, ENABLE_EXISTS_CHECK)) {
+  sparkSession.conf().set("spark.sql.files.ignoreMissingFiles", "true");
+  sparkSession.conf().set("spark.sql.files.ignoreCorruptFiles", "true");

Review Comment:
   @nsivabalan  Almost all 
[online](https://www.waitingforcode.com/apache-spark-sql/ignoring-files-issues-apache-spark-sql/read)
 
[resources](https://stackoverflow.com/questions/77195180/ignoremissingfiles-option-not-working-in-certain-scenarios)
 that I found on the usage of `spark.sql.files.ignoreMissingFiles` says that it 
needs to be set at the `SparkSession::conf()` level. 
   
   I did try setting it through the `DataFrameReader::option` using 
`spark.read().format(fileFormat).option("spark.sql.files.ignoreMissingFiles", 
"true")`, but the `reader.load(...)` fails with `FileNotFoundException` for 
missing files. Can trigger this behavior with the added tests (by commenting 
out `sparkSession.conf().set("spark.sql.files.ignoreMissingFiles", "true")` in 
`TestCloudObjectsSelectorCommon::ignoreMissingFiles(...)`. Debugger does show 
that the option is set in the `DataFrameReader` object, but it does not seem to 
impact the outcome. Seems similar to [this 
issue](https://stackoverflow.com/questions/77195180/ignoremissingfiles-option-not-working-in-certain-scenarios)
   
   ```
   ..
   ..
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856)
at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:3131)
at 
org.apache.hudi.utilities.sources.helpers.TestCloudObjectsSelectorCommon.ignoreMissingFiles(TestCloudObjectsSelectorCommon.java:173)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
at 
org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
at 
org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
at 
org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
at 
org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:84)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:210)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:206)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:131)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:65)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$5(NodeTestTask.java:139)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$7(NodeTestTask.java:129)
at 
org.junit.p

Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2003452778

   
   ## CI report:
   
   * f3c15a77a88d778d532dcc3fbed186441b3fa04c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22937)
 
   * 5016a9c8d9daeea9f6f28f63cc090514482571a4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22941)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7504] replace expensive existence check with spark options [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10865:
URL: https://github.com/apache/hudi/pull/10865#issuecomment-2003452570

   
   ## CI report:
   
   * 870069b52e67240b93cbfa9e7184eb933f296ff6 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22905)
 
   * 06e57fd2e8a4cb8276f7d718cf9614757e6e5788 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22940)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7504] replace expensive existence check with spark options [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10865:
URL: https://github.com/apache/hudi/pull/10865#issuecomment-2003422476

   
   ## CI report:
   
   * 870069b52e67240b93cbfa9e7184eb933f296ff6 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22905)
 
   * 06e57fd2e8a4cb8276f7d718cf9614757e6e5788 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2003422824

   
   ## CI report:
   
   * f3c15a77a88d778d532dcc3fbed186441b3fa04c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22937)
 
   * 5016a9c8d9daeea9f6f28f63cc090514482571a4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7504] replace expensive existence check with spark options [hudi]

2024-03-18 Thread via GitHub


bhat-vinay commented on code in PR #10865:
URL: https://github.com/apache/hudi/pull/10865#discussion_r1528192748


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java:
##
@@ -112,10 +110,15 @@ public S3EventsHoodieIncrSource(
   QueryRunner queryRunner,
   CloudDataFetcher cloudDataFetcher) {
 super(props, sparkContext, sparkSession, schemaProvider);
+
+if (getBooleanWithAltKeys(props, ENABLE_EXISTS_CHECK)) {
+  sparkSession.conf().set("spark.sql.files.ignoreMissingFiles", "true");
+  sparkSession.conf().set("spark.sql.files.ignoreCorruptFiles", "true");

Review Comment:
   @nsivabalan  Almost all 
[online](https://www.waitingforcode.com/apache-spark-sql/ignoring-files-issues-apache-spark-sql/read)
 
[resources](https://stackoverflow.com/questions/77195180/ignoremissingfiles-option-not-working-in-certain-scenarios)
 that I found on the usage of `spark.sql.files.ignoreMissingFiles` says that it 
needs to be set at the `SparkSession::conf()` level. 
   
   I did try setting it through the `DataFrameReader::option` using 
`spark.read().format(fileFormat).option("ignoreMissingFiles", "true")`, but the 
`reader.load(...)` fails with `FileNotFoundException` for missing files. Can 
trigger this behavior with the added tests (by commenting out 
`sparkSession.conf().set("spark.sql.files.ignoreMissingFiles", "true")` in 
`TestCloudObjectsSelectorCommon::ignoreMissingFiles(...)`. Debugger does show 
that the option is set in the `DataFrameReader` object, but it does not seem to 
impact the outcome. Seems similar to [this 
issue](https://stackoverflow.com/questions/77195180/ignoremissingfiles-option-not-working-in-certain-scenarios)
   
   ```
   ..
   ..
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856)
at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:3131)
at 
org.apache.hudi.utilities.sources.helpers.TestCloudObjectsSelectorCommon.ignoreMissingFiles(TestCloudObjectsSelectorCommon.java:173)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
at 
org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
at 
org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
at 
org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
at 
org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:84)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:210)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:206)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:131)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:65)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$5(NodeTestTask.java:139)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$7(NodeTestTask.java:129)
at 
org.junit.platform.engine.s

Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-18 Thread via GitHub


geserdugarov commented on code in PR #10851:
URL: https://github.com/apache/hudi/pull/10851#discussion_r1528084191


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##
@@ -725,40 +725,45 @@ private FlinkOptions() {
   .withDescription("Target IO in MB for per compaction (both read and 
write), default 500 GB");
 
   public static final ConfigOption CLEAN_ASYNC_ENABLED = ConfigOptions
-  .key("clean.async.enabled")
+  .key("hoodie.cleaner.async.enabled")

Review Comment:
   @danny0405 , if it doesn't bother you, could you, please, clarify, why Flink 
options are different from the similar Hudi internal options? From user point 
of view, it's unnecessary complication, you need to look for Flink variation of 
naming even if you know Hudi internal naming version for the same option.
   
   Could we create some naming convention for options? It's a nightmare, as I 
mentioned in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738), we 
have 900+ options for now. Could you, please, take a look at the attached file 
in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-18 Thread via GitHub


geserdugarov commented on code in PR #10851:
URL: https://github.com/apache/hudi/pull/10851#discussion_r1528084191


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##
@@ -725,40 +725,45 @@ private FlinkOptions() {
   .withDescription("Target IO in MB for per compaction (both read and 
write), default 500 GB");
 
   public static final ConfigOption CLEAN_ASYNC_ENABLED = ConfigOptions
-  .key("clean.async.enabled")
+  .key("hoodie.cleaner.async.enabled")

Review Comment:
   @danny0405 , if it doesn't bother you, could you, please, clarify, why Flink 
options are different from the similar Hudi internal options? From user point 
of view, it's unnecessary complication, you need to look for Flink variation of 
naming even if you know some Spark or Hudi internal naming version for the same 
option.
   
   Could we create some naming convention for options? It's a nightmare, as I 
mentioned in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738), we 
have 900+ options for now. Could you, please, take a look at the attached file 
in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-18 Thread via GitHub


geserdugarov commented on code in PR #10851:
URL: https://github.com/apache/hudi/pull/10851#discussion_r1528084191


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##
@@ -725,40 +725,45 @@ private FlinkOptions() {
   .withDescription("Target IO in MB for per compaction (both read and 
write), default 500 GB");
 
   public static final ConfigOption CLEAN_ASYNC_ENABLED = ConfigOptions
-  .key("clean.async.enabled")
+  .key("hoodie.cleaner.async.enabled")

Review Comment:
   @danny0405 , if it doesn't bother you, could you, please, clarify, why Flink 
options are different from the similar Hudi kernel options? From user point of 
view, it's unnecessary complication, you need to look for Flink variation of 
naming even if you know some Spark or Hudi internal naming version for the same 
option.
   
   Could we create some naming convention for options? It's a nightmare, as I 
mentioned in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738), we 
have 900+ options for now. Could you, please, take a look at the attached file 
in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-18 Thread via GitHub


geserdugarov commented on code in PR #10851:
URL: https://github.com/apache/hudi/pull/10851#discussion_r1528084191


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##
@@ -725,40 +725,45 @@ private FlinkOptions() {
   .withDescription("Target IO in MB for per compaction (both read and 
write), default 500 GB");
 
   public static final ConfigOption CLEAN_ASYNC_ENABLED = ConfigOptions
-  .key("clean.async.enabled")
+  .key("hoodie.cleaner.async.enabled")

Review Comment:
   @danny0405 , if it doesn't bother you, could you, please, clarify, why Flink 
options are different from the similar Hudi kernel options? From user point of 
view, it's unnecessary complication, you need to look for Flink variation of 
naming even if you know some Spark or Hudi internal naming version for the same 
option.
   
   Could we create some naming convention for options? It's a nightmare now, as 
I mentioned in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738), we 
have 900+ options for now. Could you, please, take a look at the attached file 
in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-18 Thread via GitHub


geserdugarov commented on code in PR #10851:
URL: https://github.com/apache/hudi/pull/10851#discussion_r1528084191


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##
@@ -725,40 +725,45 @@ private FlinkOptions() {
   .withDescription("Target IO in MB for per compaction (both read and 
write), default 500 GB");
 
   public static final ConfigOption CLEAN_ASYNC_ENABLED = ConfigOptions
-  .key("clean.async.enabled")
+  .key("hoodie.cleaner.async.enabled")

Review Comment:
   @danny0405 , if it doesn't bother you, could you, please, clarify, why Flink 
options are different from the similar Hudi kernel options? From user point of 
view, it's unnecessary complication, you need to look for Flink variation of 
naming even if you know some Spark or kernel naming version for the same option.
   
   Could we create some naming convention for options? It's a nightmare now, as 
I mentioned in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738), we 
have 900+ options for now. Could you, please, take a look at the attached file 
in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-18 Thread via GitHub


geserdugarov commented on code in PR #10851:
URL: https://github.com/apache/hudi/pull/10851#discussion_r1528087176


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##
@@ -725,40 +725,45 @@ private FlinkOptions() {
   .withDescription("Target IO in MB for per compaction (both read and 
write), default 500 GB");
 
   public static final ConfigOption CLEAN_ASYNC_ENABLED = ConfigOptions
-  .key("clean.async.enabled")
+  .key("hoodie.cleaner.async.enabled")

Review Comment:
   There are some parameters without shortcuts in `FlinkOptions`:
   ```
   hoodie.bucket.index.hash.field
   hoodie.bucket.index.num.buckets
   hoodie.database.name
   hoodie.datasource.merge.type
   hoodie.datasource.query.type
   hoodie.datasource.write.hive_style_partitioning
   hoodie.datasource.write.keygenerator.class
   hoodie.datasource.write.keygenerator.type
   hoodie.datasource.write.partitionpath.field
   hoodie.datasource.write.partitionpath.urlencode
   hoodie.datasource.write.recordkey.field
   hoodie.index.bucket.engine
   hoodie.table.name
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-18 Thread via GitHub


geserdugarov commented on code in PR #10851:
URL: https://github.com/apache/hudi/pull/10851#discussion_r1528084191


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##
@@ -725,40 +725,45 @@ private FlinkOptions() {
   .withDescription("Target IO in MB for per compaction (both read and 
write), default 500 GB");
 
   public static final ConfigOption CLEAN_ASYNC_ENABLED = ConfigOptions
-  .key("clean.async.enabled")
+  .key("hoodie.cleaner.async.enabled")

Review Comment:
   @danny0405 , if it doesn't bother you, could you, please, clarify, why Flink 
options are different from the similar Hudi kernel options? From user point of 
view, it's unnecessary complication, you need to look for Flink variation of 
naming even if you know some Spark or kernel naming version for the same option.
   
   Could we create some naming convention for options? It's a nightmare now, as 
I mentioned in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738), we 
have 900+ options for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7513] Add jackson-module-scala to spark bundle [hudi]

2024-03-18 Thread via GitHub


danny0405 commented on code in PR #10877:
URL: https://github.com/apache/hudi/pull/10877#discussion_r1528081859


##
pom.xml:
##
@@ -482,6 +482,7 @@
   org.apache.htrace:htrace-core4
   
   
com.fasterxml.jackson.module:jackson-module-afterburner
+  
com.fasterxml.jackson.module:jackson-module-scala_${scala.binary.version}
   

Review Comment:
   cc @nsivabalan do we need to check other jackson related jars?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-18 Thread via GitHub


danny0405 commented on code in PR #10851:
URL: https://github.com/apache/hudi/pull/10851#discussion_r1528063145


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##
@@ -725,40 +725,45 @@ private FlinkOptions() {
   .withDescription("Target IO in MB for per compaction (both read and 
write), default 500 GB");
 
   public static final ConfigOption CLEAN_ASYNC_ENABLED = ConfigOptions
-  .key("clean.async.enabled")
+  .key("hoodie.cleaner.async.enabled")

Review Comment:
   `clean.async.enabled` is shortcut option specifically for flink, 
`hoodie.cleaner.async.enabled` should be the fallback key instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-7492) When using Flinkcatalog to create hudi multiple partitions or multiple primary keys, the keygenerator generation is incorrect

2024-03-18 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-7492.

Resolution: Fixed

Fixed via master branch: b7ccecf32051e1f880826ee21a4b97cfd3a06f87

> When using Flinkcatalog to create hudi multiple partitions or multiple 
> primary keys, the keygenerator generation is incorrect
> -
>
> Key: HUDI-7492
> URL: https://issues.apache.org/jira/browse/HUDI-7492
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: 陈磊
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> When using Flinkcatalog to create hudi multiple partitions or multiple 
> primary keys, the keygenerator generation is incorrect.
> {code:sql}
> create table if not exists hudi_catalog.source1.hudi_key_generator_test
> (
> id   int,
> name varchar,
> age  varchar,
> sex  int,
> primary key (id) not enforced
> ) partitioned by (`sex`, `age`) with (
>   'connector' = 'hudi'
> );
> {code}
> .hoodie/hoodie.properties
> {panel:title=hoodie.properties}
> hoodie.table.keygenerator.class=org.apache.hudi.keygen.SimpleAvroKeyGenerator
> {panel}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7492) When using Flinkcatalog to create hudi multiple partitions or multiple primary keys, the keygenerator generation is incorrect

2024-03-18 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7492:
-
Fix Version/s: 0.15.0
   1.0.0

> When using Flinkcatalog to create hudi multiple partitions or multiple 
> primary keys, the keygenerator generation is incorrect
> -
>
> Key: HUDI-7492
> URL: https://issues.apache.org/jira/browse/HUDI-7492
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: 陈磊
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> When using Flinkcatalog to create hudi multiple partitions or multiple 
> primary keys, the keygenerator generation is incorrect.
> {code:sql}
> create table if not exists hudi_catalog.source1.hudi_key_generator_test
> (
> id   int,
> name varchar,
> age  varchar,
> sex  int,
> primary key (id) not enforced
> ) partitioned by (`sex`, `age`) with (
>   'connector' = 'hudi'
> );
> {code}
> .hoodie/hoodie.properties
> {panel:title=hoodie.properties}
> hoodie.table.keygenerator.class=org.apache.hudi.keygen.SimpleAvroKeyGenerator
> {panel}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated: [HUDI-7492] Fix the incorrect keygenerator specification for multi partition or multi primary key tables creation (#10840)

2024-03-18 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new b7ccecf3205 [HUDI-7492] Fix the incorrect keygenerator specification 
for multi partition or multi primary key tables creation (#10840)
b7ccecf3205 is described below

commit b7ccecf32051e1f880826ee21a4b97cfd3a06f87
Author: empcl <1515827...@qq.com>
AuthorDate: Mon Mar 18 16:27:09 2024 +0800

[HUDI-7492] Fix the incorrect keygenerator specification for multi 
partition or multi primary key tables creation (#10840)
---
 .../org/apache/hudi/table/HoodieTableFactory.java  |  7 +---
 .../apache/hudi/table/catalog/HoodieCatalog.java   |  4 ++
 .../hudi/table/catalog/HoodieHiveCatalog.java  |  3 ++
 .../java/org/apache/hudi/util/StreamerUtil.java| 12 ++
 .../hudi/table/catalog/TestHoodieCatalog.java  | 43 +
 .../hudi/table/catalog/TestHoodieHiveCatalog.java  | 45 ++
 6 files changed, 108 insertions(+), 6 deletions(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java
index 68642b39da8..65f0199ae80 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java
@@ -28,7 +28,6 @@ import org.apache.hudi.configuration.HadoopConfigurations;
 import org.apache.hudi.configuration.OptionsResolver;
 import org.apache.hudi.exception.HoodieValidationException;
 import org.apache.hudi.index.HoodieIndex;
-import org.apache.hudi.keygen.ComplexAvroKeyGenerator;
 import org.apache.hudi.keygen.NonpartitionedAvroKeyGenerator;
 import org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator;
 import org.apache.hudi.util.AvroSchemaConverter;
@@ -318,11 +317,7 @@ public class HoodieTableFactory implements 
DynamicTableSourceFactory, DynamicTab
   }
 }
 boolean complexHoodieKey = pks.length > 1 || partitions.length > 1;
-if (complexHoodieKey && FlinkOptions.isDefaultValueDefined(conf, 
FlinkOptions.KEYGEN_CLASS_NAME)) {
-  conf.setString(FlinkOptions.KEYGEN_CLASS_NAME, 
ComplexAvroKeyGenerator.class.getName());
-  LOG.info("Table option [{}] is reset to {} because record key or 
partition path has two or more fields",
-  FlinkOptions.KEYGEN_CLASS_NAME.key(), 
ComplexAvroKeyGenerator.class.getName());
-}
+StreamerUtil.checkKeygenGenerator(complexHoodieKey, conf);
   }
 
   /**
diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieCatalog.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieCatalog.java
index d25db7d82fa..f9088b4096c 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieCatalog.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieCatalog.java
@@ -343,6 +343,10 @@ public class HoodieCatalog extends AbstractCatalog {
   final String partitions = String.join(",", 
resolvedTable.getPartitionKeys());
   conf.setString(FlinkOptions.PARTITION_PATH_FIELD, partitions);
   options.put(TableOptionProperties.PARTITION_COLUMNS, partitions);
+
+  final String[] pks = 
conf.getString(FlinkOptions.RECORD_KEY_FIELD).split(",");
+  boolean complexHoodieKey = pks.length > 1 || 
resolvedTable.getPartitionKeys().size() > 1;
+  StreamerUtil.checkKeygenGenerator(complexHoodieKey, conf);
 } else {
   conf.setString(FlinkOptions.KEYGEN_CLASS_NAME.key(), 
NonpartitionedAvroKeyGenerator.class.getName());
 }
diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieHiveCatalog.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieHiveCatalog.java
index 3e409d11f5d..ce0230e6939 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieHiveCatalog.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieHiveCatalog.java
@@ -502,6 +502,9 @@ public class HoodieHiveCatalog extends AbstractCatalog {
 if (catalogTable.isPartitioned() && 
!flinkConf.contains(FlinkOptions.PARTITION_PATH_FIELD)) {
   final String partitions = String.join(",", 
catalogTable.getPartitionKeys());
   flinkConf.setString(FlinkOptions.PARTITION_PATH_FIELD, partitions);
+  final String[] pks = 
flinkConf.getString(FlinkOptions.RECORD_KEY_FIELD).split(",");
+  boolean complexHoodieKey = pks.length > 1 || 
catalogTable.getPartitionKeys().size() > 1;
+  StreamerUtil.checkKeygenGenerator(complexHoodieKey, flinkConf);
 }
 
 if (!catalogTable.isPartitioned()) {
diff --gi

  1   2   >