[GitHub] [hudi] codope commented on a diff in pull request #8900: [HUDI-6334] Integrate logcompaction table service to metadata table and provides various bugfixes to metadata table

2023-06-08 Thread via GitHub


codope commented on code in PR #8900:
URL: https://github.com/apache/hudi/pull/8900#discussion_r1223850963


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##
@@ -61,6 +61,12 @@ public class HoodieCompactionConfig extends HoodieConfig {
   + "but users are expected to trigger async job for execution. If 
`hoodie.compact.inline` is set to true, regular writers will do both scheduling 
and "
   + "execution inline for compaction");
 
+  public static final ConfigProperty ENABLE_LOG_COMPACTION = 
ConfigProperty
+  .key("hoodie.log.compaction.enable")
+  .defaultValue("false")
+  .sinceVersion("0.14")
+  .withDocumentation("By enabling log compaction through this config, log 
compaction will also gets enabled to metadata table.");

Review Comment:
   ```suggestion
 .withDocumentation("By enabling log compaction through this config, 
log compaction will also get enabled for the metadata table.");
   ```



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleFactory.java:
##
@@ -47,6 +50,7 @@ public static  HoodieMergeHandle 
create(
   String fileId,
   TaskContextSupplier taskContextSupplier,
   Option keyGeneratorOpt) {
+LOG.info("Get updateHandle for fileId " + fileId + " and partitionPath " + 
partitionPath + " at commit " + instantTime);

Review Comment:
   Are these logs really necessary? If so, please consider logging in debug 
mode. Same for all logs.



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1021,17 +1023,46 @@ private void 
runPendingTableServicesOperations(BaseHoodieWriteClient writeClient
* deltacommit.
*/
   protected void compactIfNecessary(BaseHoodieWriteClient writeClient, String 
latestDeltacommitTime) {
+
+// Check if there are any pending compaction or log compaction instants in 
the timeline.
+// If pending compact/logcompaction operations are found abort scheduling 
new compaction/logcompaction operations.
+Option pendingLogCompactionInstant =
+
metadataMetaClient.getActiveTimeline().filterPendingLogCompactionTimeline().firstInstant();
+Option pendingCompactionInstant =
+
metadataMetaClient.getActiveTimeline().filterPendingCompactionTimeline().firstInstant();
+if (pendingLogCompactionInstant.isPresent() || 
pendingCompactionInstant.isPresent()) {

Review Comment:
   this validation can be moved inside 
"validateTimelineBeforeSchedulingCompaction"



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1021,17 +1023,46 @@ private void 
runPendingTableServicesOperations(BaseHoodieWriteClient writeClient
* deltacommit.
*/
   protected void compactIfNecessary(BaseHoodieWriteClient writeClient, String 
latestDeltacommitTime) {
+

Review Comment:
   nit: remove newline



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataWriter.java:
##
@@ -109,4 +115,5 @@ public interface HoodieTableMetadataWriter extends 
Serializable, AutoCloseable {
* deciding if optimizations can be 
performed.
*/
   void performTableServices(Option inFlightInstantTimestamp);
+

Review Comment:
   nit: remove newline



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/plan/generators/BaseHoodieCompactionPlanGenerator.java:
##
@@ -82,6 +84,7 @@ public HoodieCompactionPlan generateCompactionPlan() throws 
IOException {
 
 // filter the partition paths if needed to reduce list status
 partitionPaths = filterPartitionPathsByStrategy(writeConfig, 
partitionPaths);
+LOG.info("Filtered partition paths are " + partitionPaths);

Review Comment:
   ```suggestion
   LOG.debug("Filtered partition paths are " + partitionPaths);
   ```



##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestDataValidationCheckForLogCompactionActions.java:
##
@@ -377,7 +377,7 @@ private TestTableContents setupTestTable2() throws 
IOException {
 // Create logcompaction client.
 HoodieWriteConfig logCompactionConfig = 
HoodieWriteConfig.newBuilder().withProps(config2.getProps())
 .withCompactionConfig(HoodieCompactionConfig.newBuilder()
-.withLogCompactionBlocksThreshold("2").build())
+.withLogCompactionBlocksThreshold(2).build())

Review Comment:
   good catch!



##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java:
##
@@ -233,48 +233,49 @@ void testSyncMetadataTable() throws Exception {
 assertThat(completedTimeline.lastInstant().get().getTimestamp(), 
startsWith(HoodieTableMetadata.SOLO_COMMIT_TIMESTAMP));
 
 // test metadata 

[GitHub] [hudi] danny0405 commented on a diff in pull request #8900: [HUDI-6334] Integrate logcompaction table service to metadata table and provides various bugfixes to metadata table

2023-06-08 Thread via GitHub


danny0405 commented on code in PR #8900:
URL: https://github.com/apache/hudi/pull/8900#discussion_r1223869176


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleFactory.java:
##
@@ -47,6 +50,7 @@ public static  HoodieMergeHandle 
create(
   String fileId,
   TaskContextSupplier taskContextSupplier,
   Option keyGeneratorOpt) {
+LOG.info("Get updateHandle for fileId " + fileId + " and partitionPath " + 
partitionPath + " at commit " + instantTime);

Review Comment:
   `Create update handle for fileId ... and partition path ...`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] suryaprasanna commented on a diff in pull request #8900: [HUDI-6334] Integrate logcompaction table service to metadata table and provides various bugfixes to metadata table

2023-06-08 Thread via GitHub


suryaprasanna commented on code in PR #8900:
URL: https://github.com/apache/hudi/pull/8900#discussion_r1223865818


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java:
##
@@ -111,6 +112,10 @@ public static HoodieWriteConfig createMetadataWriteConfig(
 // deltacommits having corresponding completed commits. Therefore, 
we need to compact all fileslices of all
 // partitions together requiring UnBoundedCompactionStrategy.
 .withCompactionStrategy(new UnBoundedCompactionStrategy())
+// Check if log compaction is enabled, this is needed for tables 
with lot of records.
+.withLogCompactionEnabled(writeConfig.isLogCompactionEnabled())
+// This config is only used if enableLogCompactionForMetadata is 
set.

Review Comment:
   Fixed the comment.



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1021,17 +1023,46 @@ private void 
runPendingTableServicesOperations(BaseHoodieWriteClient writeClient
* deltacommit.
*/
   protected void compactIfNecessary(BaseHoodieWriteClient writeClient, String 
latestDeltacommitTime) {
+
+// Check if there are any pending compaction or log compaction instants in 
the timeline.
+// If pending compact/logcompaction operations are found abort scheduling 
new compaction/logcompaction operations.
+Option pendingLogCompactionInstant =
+
metadataMetaClient.getActiveTimeline().filterPendingLogCompactionTimeline().firstInstant();

Review Comment:
   Test for various cases like creation of compaction plan when logcompaction 
and vice versa are present in TestHoodieClientOnMergeOnReadStorage.



##
hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/cli/ArchiveExecutorUtils.java:
##
@@ -57,6 +57,15 @@ public static int archive(JavaSparkContext jsc,
 .build();
 HoodieEngineContext context = new HoodieSparkEngineContext(jsc);
 HoodieSparkTable table = 
HoodieSparkTable.create(config, context);
+
+// Check if the metadata is already initialized. If it is initialize 
ignore the input arguments enableMetadata.

Review Comment:
   Reverting these changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #8900: [HUDI-6334] Integrate logcompaction table service to metadata table and provides various bugfixes to metadata table

2023-06-08 Thread via GitHub


nsivabalan commented on code in PR #8900:
URL: https://github.com/apache/hudi/pull/8900#discussion_r1223854254


##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieClientOnMergeOnReadStorage.java:
##
@@ -314,7 +314,7 @@ public void 
testSchedulingCompactionAfterSchedulingLogCompaction() throws Except
 
 // Try scheduling compaction, it wont succeed
 Option compactionTimeStamp = 
client.scheduleCompaction(Option.empty());
-assertFalse(compactionTimeStamp.isPresent());

Review Comment:
   do we know the reason why we had to flip. 



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java:
##
@@ -111,6 +112,10 @@ public static HoodieWriteConfig createMetadataWriteConfig(
 // deltacommits having corresponding completed commits. Therefore, 
we need to compact all fileslices of all
 // partitions together requiring UnBoundedCompactionStrategy.
 .withCompactionStrategy(new UnBoundedCompactionStrategy())
+// Check if log compaction is enabled, this is needed for tables 
with lot of records.
+.withLogCompactionEnabled(writeConfig.isLogCompactionEnabled())
+// This config is only used if enableLogCompactionForMetadata is 
set.

Review Comment:
   not sure I get your comment here "This config is only used if 
enableLogCompactionForMetadata is set". from the code, it looks like we fetch 
from  writeConfig.isLogCompactionEnabled().



##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java:
##
@@ -233,48 +233,49 @@ void testSyncMetadataTable() throws Exception {
 assertThat(completedTimeline.lastInstant().get().getTimestamp(), 
startsWith(HoodieTableMetadata.SOLO_COMMIT_TIMESTAMP));
 
 // test metadata table compaction
-// write another 4 commits
-for (int i = 1; i < 5; i++) {
+// write another 9 commits to trigger compaction twice. Since default 
clean version to retain is 2.

Review Comment:
   @danny0405 : can you review changes in flink classes.



##
hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/cli/ArchiveExecutorUtils.java:
##
@@ -57,6 +57,15 @@ public static int archive(JavaSparkContext jsc,
 .build();
 HoodieEngineContext context = new HoodieSparkEngineContext(jsc);
 HoodieSparkTable table = 
HoodieSparkTable.create(config, context);
+
+// Check if the metadata is already initialized. If it is initialize 
ignore the input arguments enableMetadata.

Review Comment:
   are these required ? 



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1021,17 +1023,46 @@ private void 
runPendingTableServicesOperations(BaseHoodieWriteClient writeClient
* deltacommit.
*/
   protected void compactIfNecessary(BaseHoodieWriteClient writeClient, String 
latestDeltacommitTime) {
+
+// Check if there are any pending compaction or log compaction instants in 
the timeline.
+// If pending compact/logcompaction operations are found abort scheduling 
new compaction/logcompaction operations.
+Option pendingLogCompactionInstant =
+
metadataMetaClient.getActiveTimeline().filterPendingLogCompactionTimeline().firstInstant();

Review Comment:
   do we have tests for these? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8900: [HUDI-6334] Integrate logcompaction table service to metadata table and provides various bugfixes to metadata table

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8900:
URL: https://github.com/apache/hudi/pull/8900#issuecomment-1583999590

   
   ## CI report:
   
   * f9e3b8dd406a43d5808ee93105efb9154b05a6cb Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17691)
 
   * 85e65864e9376baf4d84149310810983751b87eb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17698)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8913: [HUDI-6343] Fixing fileId format for all mdt partitions

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8913:
URL: https://github.com/apache/hudi/pull/8913#issuecomment-1583999661

   
   ## CI report:
   
   * 3580939238ab2c8a458df5d4a14b0a6f07ccebed Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17694)
 
   * 2aa49c14bf4df38e11087d1add3518190093f7cc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17699)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8885: [HUDI-6198] Support Hudi on Spark 3.4.0

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8885:
URL: https://github.com/apache/hudi/pull/8885#issuecomment-1583999528

   
   ## CI report:
   
   * 6bcf646df9a0223b8787e7bae2255c628aea54b4 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17693)
 
   * 8fda23303081b08c252c8d0eb74abe431af44901 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17697)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.

2023-06-08 Thread via GitHub


danny0405 commented on code in PR #8837:
URL: https://github.com/apache/hudi/pull/8837#discussion_r1223859381


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -851,26 +919,49 @@ public void update(HoodieRestoreMetadata restoreMetadata, 
String instantTime) {
*/
   @Override
   public void update(HoodieRollbackMetadata rollbackMetadata, String 
instantTime) {
-if (enabled && metadata != null) {
-  // Is this rollback of an instant that has been synced to the metadata 
table?
-  String rollbackInstant = rollbackMetadata.getCommitsRollback().get(0);
-  boolean wasSynced = 
metadataMetaClient.getActiveTimeline().containsInstant(new HoodieInstant(false, 
HoodieTimeline.DELTA_COMMIT_ACTION, rollbackInstant));
-  if (!wasSynced) {
-// A compaction may have taken place on metadata table which would 
have included this instant being rolled back.
-// Revisit this logic to relax the compaction fencing : 
https://issues.apache.org/jira/browse/HUDI-2458
-Option latestCompaction = metadata.getLatestCompactionTime();
-if (latestCompaction.isPresent()) {
-  wasSynced = HoodieTimeline.compareTimestamps(rollbackInstant, 
HoodieTimeline.LESSER_THAN_OR_EQUALS, latestCompaction.get());
-}
+// The commit which is being rolled back on the dataset
+final String commitInstantTime = 
rollbackMetadata.getCommitsRollback().get(0);
+// Find the deltacommits since the last compaction
+Option> deltaCommitsInfo =
+
CompactionUtils.getDeltaCommitsSinceLatestCompaction(metadataMetaClient.getActiveTimeline());
+if (!deltaCommitsInfo.isPresent()) {
+  LOG.info(String.format("Ignoring rollback of instant %s at %s since 
there are no deltacommits on MDT", commitInstantTime, instantTime));
+  return;
+}
+
+// This could be a compaction or deltacommit instant (See 
CompactionUtils.getDeltaCommitsSinceLatestCompaction)
+HoodieInstant compactionInstant = deltaCommitsInfo.get().getValue();
+HoodieTimeline deltacommitsSinceCompaction = 
deltaCommitsInfo.get().getKey();
+
+// The deltacommit that will be rolled back
+HoodieInstant deltaCommitInstant = new HoodieInstant(false, 
HoodieTimeline.DELTA_COMMIT_ACTION, commitInstantTime);
+
+// The commit being rolled back should not be older than the latest 
compaction on the MDT. Compaction on MDT only occurs when all actions
+// are completed on the dataset. Hence, this case implies a rollback of 
completed commit which should actually be handled using restore.
+if (compactionInstant.getAction().equals(HoodieTimeline.COMMIT_ACTION)) {
+  final String compactionInstantTime = compactionInstant.getTimestamp();
+  if (HoodieTimeline.LESSER_THAN_OR_EQUALS.test(commitInstantTime, 
compactionInstantTime)) {
+throw new HoodieMetadataException(String.format("Commit being rolled 
back %s is older than the latest compaction %s. "
++ "There are %d deltacommits after this compaction: %s", 
commitInstantTime, compactionInstantTime,
+deltacommitsSinceCompaction.countInstants(), 
deltacommitsSinceCompaction.getInstants()));
   }
+}
 
-  Map> records =
-  HoodieTableMetadataUtil.convertMetadataToRecords(engineContext, 
metadataMetaClient.getActiveTimeline(),
-  rollbackMetadata, getRecordsGenerationParams(), instantTime,
-  metadata.getSyncedInstantTime(), wasSynced);
-  commit(instantTime, records, false);
-  closeInternal();
+if (deltaCommitsInfo.get().getKey().containsInstant(deltaCommitInstant)) {
+  LOG.info("Rolling back MDT deltacommit " + commitInstantTime);
+  if (!getWriteClient().rollback(commitInstantTime, instantTime)) {
+throw new HoodieMetadataException("Failed to rollback deltacommit at " 
+ commitInstantTime);
+  }
+} else {
+  LOG.info(String.format("Ignoring rollback of instant %s at %s since 
there are no corresponding deltacommits on MDT",
+  commitInstantTime, instantTime));
 }
+
+// Rollback of MOR table may end up adding a new log file. So we need to 
check for added files and add them to MDT
+processAndCommit(instantTime, () -> 
HoodieTableMetadataUtil.convertMetadataToRecords(engineContext, 
metadataMetaClient.getActiveTimeline(),
+rollbackMetadata, getRecordsGenerationParams(), instantTime,
+metadata.getSyncedInstantTime(), true), false);

Review Comment:
   Not sure why we perform the rollback again ~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.

2023-06-08 Thread via GitHub


danny0405 commented on code in PR #8837:
URL: https://github.com/apache/hudi/pull/8837#discussion_r1223857627


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -851,26 +919,49 @@ public void update(HoodieRestoreMetadata restoreMetadata, 
String instantTime) {
*/
   @Override
   public void update(HoodieRollbackMetadata rollbackMetadata, String 
instantTime) {
-if (enabled && metadata != null) {
-  // Is this rollback of an instant that has been synced to the metadata 
table?
-  String rollbackInstant = rollbackMetadata.getCommitsRollback().get(0);
-  boolean wasSynced = 
metadataMetaClient.getActiveTimeline().containsInstant(new HoodieInstant(false, 
HoodieTimeline.DELTA_COMMIT_ACTION, rollbackInstant));
-  if (!wasSynced) {
-// A compaction may have taken place on metadata table which would 
have included this instant being rolled back.
-// Revisit this logic to relax the compaction fencing : 
https://issues.apache.org/jira/browse/HUDI-2458
-Option latestCompaction = metadata.getLatestCompactionTime();
-if (latestCompaction.isPresent()) {
-  wasSynced = HoodieTimeline.compareTimestamps(rollbackInstant, 
HoodieTimeline.LESSER_THAN_OR_EQUALS, latestCompaction.get());
-}
+// The commit which is being rolled back on the dataset
+final String commitInstantTime = 
rollbackMetadata.getCommitsRollback().get(0);
+// Find the deltacommits since the last compaction
+Option> deltaCommitsInfo =
+
CompactionUtils.getDeltaCommitsSinceLatestCompaction(metadataMetaClient.getActiveTimeline());
+if (!deltaCommitsInfo.isPresent()) {
+  LOG.info(String.format("Ignoring rollback of instant %s at %s since 
there are no deltacommits on MDT", commitInstantTime, instantTime));
+  return;
+}
+
+// This could be a compaction or deltacommit instant (See 
CompactionUtils.getDeltaCommitsSinceLatestCompaction)
+HoodieInstant compactionInstant = deltaCommitsInfo.get().getValue();
+HoodieTimeline deltacommitsSinceCompaction = 
deltaCommitsInfo.get().getKey();
+
+// The deltacommit that will be rolled back
+HoodieInstant deltaCommitInstant = new HoodieInstant(false, 
HoodieTimeline.DELTA_COMMIT_ACTION, commitInstantTime);
+
+// The commit being rolled back should not be older than the latest 
compaction on the MDT. Compaction on MDT only occurs when all actions
+// are completed on the dataset. Hence, this case implies a rollback of 
completed commit which should actually be handled using restore.
+if (compactionInstant.getAction().equals(HoodieTimeline.COMMIT_ACTION)) {
+  final String compactionInstantTime = compactionInstant.getTimestamp();
+  if (HoodieTimeline.LESSER_THAN_OR_EQUALS.test(commitInstantTime, 
compactionInstantTime)) {
+throw new HoodieMetadataException(String.format("Commit being rolled 
back %s is older than the latest compaction %s. "
++ "There are %d deltacommits after this compaction: %s", 
commitInstantTime, compactionInstantTime,
+deltacommitsSinceCompaction.countInstants(), 
deltacommitsSinceCompaction.getInstants()));
   }
+}
 
-  Map> records =
-  HoodieTableMetadataUtil.convertMetadataToRecords(engineContext, 
metadataMetaClient.getActiveTimeline(),
-  rollbackMetadata, getRecordsGenerationParams(), instantTime,
-  metadata.getSyncedInstantTime(), wasSynced);
-  commit(instantTime, records, false);
-  closeInternal();
+if (deltaCommitsInfo.get().getKey().containsInstant(deltaCommitInstant)) {

Review Comment:
   Use `deltacommitsSinceCompaction` should be fine?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] voonhous commented on issue #8892: [SUPPORT] [BUG] Duplicate fileID ??? from bucket ?? of partition found during the BucketStreamWriteFunction index bootstrap.

2023-06-08 Thread via GitHub


voonhous commented on issue #8892:
URL: https://github.com/apache/hudi/issues/8892#issuecomment-1583996315

   @pftn can you please help to verify if the data in these 2 parquets are the 
same?
   
   1. 
20220604/0007-3477-401f-982e-e5ae38ca0e23_3-20-6_20230510170043301.parquet
   2. 
20220604/0007-4bc1-4340-a9d8-330666a58244_5-20-6_20230511183601566.parquet
   
   Do you still have the compaction plans that generated these 2 parquet files, 
it'll be extremely helpful if we can know the write token of the log files 
before compaction. Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8913: [HUDI-6343] Fixing fileId format for all mdt partitions

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8913:
URL: https://github.com/apache/hudi/pull/8913#issuecomment-1583993645

   
   ## CI report:
   
   * 3580939238ab2c8a458df5d4a14b0a6f07ccebed Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17694)
 
   * 2aa49c14bf4df38e11087d1add3518190093f7cc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8911: [Hudi-8882] Compatible with hive 2.2.x to read hudi rt table

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8911:
URL: https://github.com/apache/hudi/pull/8911#issuecomment-1583993589

   
   ## CI report:
   
   * 1ddc84cab970a6a43ea77a729213dc8c5200d845 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17690)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8900: [HUDI-6334] Integrate logcompaction table service to metadata table and provides various bugfixes to metadata table

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8900:
URL: https://github.com/apache/hudi/pull/8900#issuecomment-1583993513

   
   ## CI report:
   
   * fe74a9a7d32286ae29ded9370f6d53ccb14c8809 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17677)
 
   * f9e3b8dd406a43d5808ee93105efb9154b05a6cb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17691)
 
   * 85e65864e9376baf4d84149310810983751b87eb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8885: [HUDI-6198] Support Hudi on Spark 3.4.0

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8885:
URL: https://github.com/apache/hudi/pull/8885#issuecomment-1583993438

   
   ## CI report:
   
   * e2f44f2a1f574eed79090b337d7bd56e08058b51 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17680)
 
   * 6bcf646df9a0223b8787e7bae2255c628aea54b4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17693)
 
   * 8fda23303081b08c252c8d0eb74abe431af44901 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on pull request #8885: [HUDI-6198] Support Hudi on Spark 3.4.0

2023-06-08 Thread via GitHub


yihua commented on PR #8885:
URL: https://github.com/apache/hudi/pull/8885#issuecomment-1583990851

   @CTTY Thanks for the review.  I addressed all your comments.  @rahil-c 
@mansipp let me know if you have more comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #8885: [HUDI-6198] Support Hudi on Spark 3.4.0

2023-06-08 Thread via GitHub


yihua commented on code in PR #8885:
URL: https://github.com/apache/hudi/pull/8885#discussion_r1223853653


##
hudi-spark-datasource/hudi-spark3.4.x/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark34HoodieParquetFileFormat.scala:
##
@@ -0,0 +1,532 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.mapred.FileSplit
+import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+import org.apache.hadoop.mapreduce.{JobID, TaskAttemptID, TaskID, TaskType}
+import org.apache.hudi.HoodieSparkUtils
+import org.apache.hudi.client.utils.SparkInternalSchemaConverter
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.util.InternalSchemaCache
+import org.apache.hudi.common.util.StringUtils.isNullOrEmpty
+import org.apache.hudi.common.util.collection.Pair
+import org.apache.hudi.internal.schema.InternalSchema
+import org.apache.hudi.internal.schema.action.InternalSchemaMerger
+import org.apache.hudi.internal.schema.utils.{InternalSchemaUtils, SerDeHelper}
+import org.apache.parquet.filter2.compat.FilterCompat
+import org.apache.parquet.filter2.predicate.FilterApi
+import 
org.apache.parquet.format.converter.ParquetMetadataConverter.SKIP_ROW_GROUPS
+import org.apache.parquet.hadoop.{ParquetInputFormat, ParquetRecordReader}
+import org.apache.spark.TaskContext
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
+import org.apache.spark.sql.catalyst.expressions.{Cast, JoinedRow}
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.execution.WholeStageCodegenExec
+import 
org.apache.spark.sql.execution.datasources.parquet.Spark34HoodieParquetFileFormat._
+import org.apache.spark.sql.execution.datasources.{DataSourceUtils, 
PartitionedFile, RecordReaderIterator}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.sources._
+import org.apache.spark.sql.types.{AtomicType, DataType, StructField, 
StructType}
+import org.apache.spark.util.SerializableConfiguration
+/**
+ * This class is an extension of [[ParquetFileFormat]] overriding 
Spark-specific behavior
+ * that's not possible to customize in any other way
+ *
+ * NOTE: This is a version of [[AvroDeserializer]] impl from Spark 3.2.1 w/ w/ 
the following changes applied to it:
+ * 
+ *   Avoiding appending partition values to the rows read from the data 
file
+ *   Schema on-read
+ * 
+ */
+class Spark34HoodieParquetFileFormat(private val shouldAppendPartitionValues: 
Boolean) extends ParquetFileFormat {
+
+  override def supportBatch(sparkSession: SparkSession, schema: StructType): 
Boolean = {
+val conf = sparkSession.sessionState.conf
+conf.parquetVectorizedReaderEnabled && 
schema.forall(_.dataType.isInstanceOf[AtomicType])
+  }
+
+  def supportsColumnar(sparkSession: SparkSession, schema: StructType): 
Boolean = {
+val conf = sparkSession.sessionState.conf
+// Only output columnar if there is WSCG to read it.
+val requiredWholeStageCodegenSettings =
+  conf.wholeStageEnabled && !WholeStageCodegenExec.isTooManyFields(conf, 
schema)
+requiredWholeStageCodegenSettings &&
+  supportBatch(sparkSession, schema)
+  }
+
+  override def buildReaderWithPartitionValues(sparkSession: SparkSession,
+  dataSchema: StructType,
+  partitionSchema: StructType,
+  requiredSchema: StructType,
+  filters: Seq[Filter],
+  options: Map[String, String],
+  hadoopConf: Configuration): 
PartitionedFile => Iterator[InternalRow] = {
+hadoopConf.set(ParquetInputFormat.READ_SUPPORT_CLASS, 
classOf[ParquetReadSupport].getName)
+hadoopConf.set(
+  ParquetReadSupport.SPARK_ROW_REQUESTED_SCHEMA,
+  requiredSchema.json)
+hadoopConf.set(
+  

[GitHub] [hudi] hudi-bot commented on pull request #8905: [HUDI-6337] Incremental Clean ignore partitions affected by append write commits/delta commits

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8905:
URL: https://github.com/apache/hudi/pull/8905#issuecomment-1583987938

   
   ## CI report:
   
   * f8f14263190df7b66143e192188e68463e0c1efd UNKNOWN
   * f9adcecf4e54774510569f14af4c81a1f4951a28 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17681)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17689)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.

2023-06-08 Thread via GitHub


danny0405 commented on code in PR #8837:
URL: https://github.com/apache/hudi/pull/8837#discussion_r1223851326


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -837,10 +840,75 @@ public void update(HoodieCleanMetadata cleanMetadata, 
String instantTime) {
*/
   @Override
   public void update(HoodieRestoreMetadata restoreMetadata, String 
instantTime) {
-processAndCommit(instantTime, () -> 
HoodieTableMetadataUtil.convertMetadataToRecords(engineContext,
-metadataMetaClient.getActiveTimeline(), restoreMetadata, 
getRecordsGenerationParams(), instantTime,
-metadata.getSyncedInstantTime()), false);
-closeInternal();
+dataMetaClient.reloadActiveTimeline();
+
+// Since the restore has completed on the dataset, the latest write 
timeline instant is the one to which the
+// restore was performed. This should be always present.
+final String restoreToInstantTime = 
dataMetaClient.getActiveTimeline().getWriteTimeline()
+.getReverseOrderedInstants().findFirst().get().getTimestamp();

Review Comment:
   Why not use `lastInstant` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ad1happy2go commented on issue #8904: [SUPPORT] spark-sql hudi table Caused by: org.apache.avro.AvroTypeException: Found string, expecting union

2023-06-08 Thread via GitHub


ad1happy2go commented on issue #8904:
URL: https://github.com/apache/hudi/issues/8904#issuecomment-1583978910

   @zyclove This is known issue with hudi 0.11.1. 
   
   This was fixed with this commit - https://github.com/apache/hudi/pull/6358
   
   Can you try out this and let us know.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-3891) Investigate Hudi vs Raw Parquet table discrepancy

2023-06-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3891:
-
Epic Link: HUDI-1297

> Investigate Hudi vs Raw Parquet table discrepancy
> -
>
> Key: HUDI-3891
> URL: https://issues.apache.org/jira/browse/HUDI-3891
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
> Attachments: image-2022-04-16-13-50-43-916.png, 
> image-2022-04-16-13-50-43-956.png
>
>
> While benchmarking querying raw Parquet tables against Hudi tables, i've run 
> the test against the same (Hudi) table:
>  # In one query path i'm reading it as just a raw Parquet table
>  # In another, i'm reading it as Hudi RO (read_optimized) table
> Surprisingly enough, those 2 diverge in the # of files being read:
>  
> _Raw Parquet_
> !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png|width=1691,height=149!
>  
> _Hudi_
> !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png|width=1673,height=142!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #8914: [HUDI-6344] Flink MDT bulk_insert for initial commit

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8914:
URL: https://github.com/apache/hudi/pull/8914#issuecomment-1583960558

   
   ## CI report:
   
   * c72b73a619fbc720e343b1fc5a0e3e9506857d1b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17695)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8913: [HUDI-6343] Fixing fileId format for all mdt partitions

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8913:
URL: https://github.com/apache/hudi/pull/8913#issuecomment-1583960533

   
   ## CI report:
   
   * 3580939238ab2c8a458df5d4a14b0a6f07ccebed Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17694)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-6342) Fix flaky MultiTableDeltaStreamer test

2023-06-08 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6342.

Fix Version/s: 0.14.0
   Resolution: Fixed

Fixed via master branch: f1c8049f81af94dc4b01b25eb80218a9d97f2a8e

> Fix flaky MultiTableDeltaStreamer test
> --
>
> Key: HUDI-6342
> URL: https://issues.apache.org/jira/browse/HUDI-6342
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> TestHoodieDeltaStreamerWithMultiWriter.
> testUpsertsContinuousModeWithMultipleWritersForConflicts 
> is flaky in recent times. 
>  
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/17675/logs/21]
>  
> {code:java}
> 2023-06-08T14:02:50.4346417Z 798455 [pool-1655-thread-1] ERROR 
> org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamerWithMultiWriter
>  [] - Continuous job failed java.lang.RuntimeException: Ingestion service was 
> shut down with exception.
> 2023-06-08T14:02:50.4351308Z 798455 [Listener at localhost/45789] ERROR 
> org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamerWithMultiWriter
>  [] - Conflict happened, but not expected 
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
> Ingestion service was shut down with exception.
> 2023-06-08T14:02:50.7579883Z [ERROR] Tests run: 5, Failures: 0, Errors: 1, 
> Skipped: 1, Time elapsed: 201.181 s <<< FAILURE! - in 
> org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamerWithMultiWriter
> 2023-06-08T14:02:50.7615120Z [ERROR] 
> testUpsertsContinuousModeWithMultipleWritersForConflicts{HoodieTableType}[2]  
> Time elapsed: 56.062 s  <<< ERROR!
> 2023-06-08T14:02:50.7615570Z java.util.concurrent.ExecutionException: 
> java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
> java.lang.RuntimeException: Ingestion service was shut down with exception.
> 2023-06-08T14:02:50.7616039Z  at 
> java.util.concurrent.FutureTask.report(FutureTask.java:122)
> 2023-06-08T14:02:50.7616662Z  at 
> java.util.concurrent.FutureTask.get(FutureTask.java:192)
> 2023-06-08T14:02:50.7617179Z  at 
> org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamerWithMultiWriter.runJobsInParallel(TestHoodieDeltaStreamerWithMultiWriter.java:398)
> 2023-06-08T14:02:50.7617674Z  at 
> org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWritersForConflicts(TestHoodieDeltaStreamerWithMultiWriter.java:140)
> 2023-06-08T14:02:50.7618059Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2023-06-08T14:02:50.7618319Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2023-06-08T14:02:50.7618615Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2023-06-08T14:02:50.7618896Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2023-06-08T14:02:50.7619173Z  at 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
> 2023-06-08T14:02:50.7619480Z  at 
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
> 2023-06-08T14:02:50.7619845Z  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
> 2023-06-08T14:02:50.7620217Z  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
> 2023-06-08T14:02:50.7620540Z  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
> 2023-06-08T14:02:50.7620903Z  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:92)
> 2023-06-08T14:02:50.7621288Z  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
> 2023-06-08T14:02:50.7621849Z  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
> 2023-06-08T14:02:50.767Z  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
> 2023-06-08T14:02:50.7622626Z  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
> 2023-06-08T14:02:50.7623010Z  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
> 2023-06-08T14:02:50.7623375Z  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
> 2023-06-08T14:02:50.7623723Z  at 
> 

[hudi] branch master updated (593181397e2 -> f1c8049f81a)

2023-06-08 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 593181397e2 [HUDI-5352] Fix `LocalDate` serialization in colstats 
(#8840)
 add f1c8049f81a [HUDI-6342] Fixing flaky Continuous mode multi writer 
tests (#8910)

No new revisions were added by this update.

Summary of changes:
 .../TestHoodieDeltaStreamerWithMultiWriter.java | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)



[GitHub] [hudi] danny0405 merged pull request #8910: [HUDI-6342] Fixing flaky Continuous mode multi writer tests

2023-06-08 Thread via GitHub


danny0405 merged PR #8910:
URL: https://github.com/apache/hudi/pull/8910


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] codope commented on a diff in pull request #8913: [HUDI-6343] Fixing fileId format for all mdt partitions

2023-06-08 Thread via GitHub


codope commented on code in PR #8913:
URL: https://github.com/apache/hudi/pull/8913#discussion_r1223830445


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMetadataTableWithSparkDataSource.scala:
##
@@ -84,23 +84,21 @@ class TestMetadataTableWithSparkDataSource extends 
SparkClientFunctionalTestHarn
   .mode(SaveMode.Append)
   .save(basePath)
 
-// Files partition of MT
-val filesPartitionDF = 
spark.read.format(hudi).load(s"$basePath/.hoodie/metadata/files")
+val mdtDf = spark.read.format("hudi").load(s"$basePath/.hoodie/metadata")
+mdtDf.show()

Review Comment:
   yeah let's remove this.. can be helpful for debugging but already our logs 
are bloated. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #8126: [SUPPORT] Exit code 137 (interrupted by signal 9: SIGKILL) when StreamWriteFunction detect object size

2023-06-08 Thread via GitHub


danny0405 commented on issue #8126:
URL: https://github.com/apache/hudi/issues/8126#issuecomment-1583957629

   Yeah, it is introduced by https://github.com/apache/hudi/pull/6657, @codope 
, do you have any thoughts here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8913: [HUDI-6343] Fixing fileId format for all mdt partitions

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8913:
URL: https://github.com/apache/hudi/pull/8913#issuecomment-1583956648

   
   ## CI report:
   
   * 3580939238ab2c8a458df5d4a14b0a6f07ccebed UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8914: [HUDI-6344] Flink MDT bulk_insert for initial commit

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8914:
URL: https://github.com/apache/hudi/pull/8914#issuecomment-1583956673

   
   ## CI report:
   
   * c72b73a619fbc720e343b1fc5a0e3e9506857d1b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] codope commented on a diff in pull request #8910: [HUDI-6342] Fixing flaky Continuous mode multi writer tests

2023-06-08 Thread via GitHub


codope commented on code in PR #8910:
URL: https://github.com/apache/hudi/pull/8910#discussion_r1223827683


##
hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamerWithMultiWriter.java:
##
@@ -404,12 +405,24 @@ private void runJobsInParallel(String tableBasePath, 
HoodieTableType tableType,
* Need to perform getMessage().contains since the exception coming
* from {@link 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.DeltaSyncService} 
gets wrapped many times into RuntimeExceptions.
*/
-  if (expectConflict && 
e.getCause().getMessage().contains(ConcurrentModificationException.class.getName()))
 {
+  if (expectConflict && backfillFailed.get() && 
e.getCause().getMessage().contains(ConcurrentModificationException.class.getName()))
 {
 // expected ConcurrentModificationException since ingestion & backfill 
will have overlapping writes
-if (backfillFailed.get()) {
+if (!continuousFailed.get()) {
   // if backfill job failed, shutdown the continuous job.
   LOG.warn("Calling shutdown on ingestion job since the backfill job 
has failed for " + jobId);
   ingestionJob.shutdownGracefully();
+} else {
+  // both backfill and ingestion job cannot fail.
+  throw new HoodieException("Both backfilling and ingestion job failed 
", e);
+}
+  } else if (expectConflict && continuousFailed.get() && 
e.getCause().getMessage().contains("Ingestion service was shut down with 
exception")) {
+// incase of regular ingestion job failing, 
ConcurrentModificationException is not throw all the way.

Review Comment:
   nit: `thrown`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #8913: [HUDI-6343] Fixing fileId format for all mdt partitions

2023-06-08 Thread via GitHub


danny0405 commented on code in PR #8913:
URL: https://github.com/apache/hudi/pull/8913#discussion_r1223826247


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMetadataTableWithSparkDataSource.scala:
##
@@ -84,23 +84,21 @@ class TestMetadataTableWithSparkDataSource extends 
SparkClientFunctionalTestHarn
   .mode(SaveMode.Append)
   .save(basePath)
 
-// Files partition of MT
-val filesPartitionDF = 
spark.read.format(hudi).load(s"$basePath/.hoodie/metadata/files")
+val mdtDf = spark.read.format("hudi").load(s"$basePath/.hoodie/metadata")
+mdtDf.show()

Review Comment:
   what the purpose of the `show` calling?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #8913: [HUDI-6343] Fixing fileId format for all mdt partitions

2023-06-08 Thread via GitHub


danny0405 commented on code in PR #8913:
URL: https://github.com/apache/hudi/pull/8913#discussion_r1223825494


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -1453,11 +1453,7 @@ public static String 
deleteMetadataTablePartition(HoodieTableMetaClient dataMeta
* @return The fileID
*/
   public static String getFileIDForFileGroup(MetadataPartitionType 
partitionType, int index) {
-if (partitionType == MetadataPartitionType.FILES) {
-  return String.format("%s%04d-%d", partitionType.getFileIdPrefix(), 
index, 0);
-} else {
-  return String.format("%s%04d", partitionType.getFileIdPrefix(), index);
-}
+return String.format("%s%04d-%d", partitionType.getFileIdPrefix(), index, 
0);

Review Comment:
   Nice catch ~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6344) Support Flink MDT bulk_insert

2023-06-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6344:
-
Labels: pull-request-available  (was: )

> Support Flink MDT bulk_insert
> -
>
> Key: HUDI-6344
> URL: https://issues.apache.org/jira/browse/HUDI-6344
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink-sql
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] danny0405 opened a new pull request, #8914: [HUDI-6344] Flink MDT bulk_insert for initial commit

2023-06-08 Thread via GitHub


danny0405 opened a new pull request, #8914:
URL: https://github.com/apache/hudi/pull/8914

   ### Change Logs
   
   Fix the bulk_insert for Flink MDT initialization after #8684 .
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8910: [HUDI-6342] Fixing flaky Continuous mode multi writer tests

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8910:
URL: https://github.com/apache/hudi/pull/8910#issuecomment-1583949317

   
   ## CI report:
   
   * fc4825fd3b646e3b69322b386fa4b2fd4f19ba67 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17687)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6344) Support Flink MDT bulk_insert

2023-06-08 Thread Danny Chen (Jira)
Danny Chen created HUDI-6344:


 Summary: Support Flink MDT bulk_insert
 Key: HUDI-6344
 URL: https://issues.apache.org/jira/browse/HUDI-6344
 Project: Apache Hudi
  Issue Type: Improvement
  Components: flink-sql
Reporter: Danny Chen






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6344) Support Flink MDT bulk_insert

2023-06-08 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-6344:
-
Fix Version/s: 0.14.0

> Support Flink MDT bulk_insert
> -
>
> Key: HUDI-6344
> URL: https://issues.apache.org/jira/browse/HUDI-6344
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink-sql
>Reporter: Danny Chen
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nsivabalan commented on pull request #8910: [HUDI-6342] Fixing flaky Continuous mode multi writer tests

2023-06-08 Thread via GitHub


nsivabalan commented on PR #8910:
URL: https://github.com/apache/hudi/pull/8910#issuecomment-1583934703

   we have an unrelated test failure 
   ```
   Test Call run_clustering Procedure Order Strategy *** FAILED ***
   ```
   since this patch is also fixing a flaky test, prefer to go ahead w/ landing. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6343) File id format differs from FILES partition and others when its initialized for first time

2023-06-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6343:
-
Labels: pull-request-available  (was: )

> File id format differs from FILES partition and others when its initialized 
> for first time
> --
>
> Key: HUDI-6343
> URL: https://issues.apache.org/jira/browse/HUDI-6343
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> [https://github.com/apache/hudi/blob/593181397e2f03b1172487e280ad279557bbf423/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java#L1455]
>  
> When bulk insert gets triggered, the file group Id might differ for other 
> partitions. We might need to fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nsivabalan opened a new pull request, #8913: [HUDI-6343] Fixing fileId format for all mdt partitions

2023-06-08 Thread via GitHub


nsivabalan opened a new pull request, #8913:
URL: https://github.com/apache/hudi/pull/8913

   ### Change Logs
   
   Fixing fileId format for all mdt partitions. When bulk_insert gets 
triggered, the fileId will get suffixed with "-0" in the end. And so, we might 
need to ensure the initial instantiation also follows the same format. 
   
   ### Impact
   
   Fixing fileId format for all mdt partitions
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6343) File id format differs from FILES partition and others when its initialized for first time

2023-06-08 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-6343:
-

 Summary: File id format differs from FILES partition and others 
when its initialized for first time
 Key: HUDI-6343
 URL: https://issues.apache.org/jira/browse/HUDI-6343
 Project: Apache Hudi
  Issue Type: Bug
  Components: metadata
Reporter: sivabalan narayanan


[https://github.com/apache/hudi/blob/593181397e2f03b1172487e280ad279557bbf423/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java#L1455]

 

When bulk insert gets triggered, the file group Id might differ for other 
partitions. We might need to fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #8885: [HUDI-6198] Support Hudi on Spark 3.4.0

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8885:
URL: https://github.com/apache/hudi/pull/8885#issuecomment-1583913902

   
   ## CI report:
   
   * e2f44f2a1f574eed79090b337d7bd56e08058b51 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17680)
 
   * 6bcf646df9a0223b8787e7bae2255c628aea54b4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17693)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8885: [HUDI-6198] Support Hudi on Spark 3.4.0

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8885:
URL: https://github.com/apache/hudi/pull/8885#issuecomment-1583907653

   
   ## CI report:
   
   * e2f44f2a1f574eed79090b337d7bd56e08058b51 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17680)
 
   * 6bcf646df9a0223b8787e7bae2255c628aea54b4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #8885: [HUDI-6198] Support Hudi on Spark 3.4.0

2023-06-08 Thread via GitHub


yihua commented on code in PR #8885:
URL: https://github.com/apache/hudi/pull/8885#discussion_r1223791778


##
hudi-common/src/main/java/org/apache/hudi/common/util/JsonUtils.java:
##
@@ -35,6 +36,8 @@ public class JsonUtils {
   private static final ObjectMapper MAPPER = new ObjectMapper();
 
   static {
+registerModules(MAPPER);
+

Review Comment:
   This is fixed now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zyclove opened a new issue, #8912: [SUPPORT]hudi 0.12.2 sometimes appear org.apache.hudi.exception.HoodieIOException: IOException when reading log file

2023-06-08 Thread via GitHub


zyclove opened a new issue, #8912:
URL: https://github.com/apache/hudi/issues/8912

   
   **Describe the problem you faced**
   hudi timing spark-sql scheduling tasks sometimes appear 
org.apache.hudi.exception.HoodieIOException: IOException when reading log file.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.spark-sql hudi task execute once in half an hour.
   2. sometimes error with log file missing, as follow.
   3.just touch miss file, rerun is ok. but it is not a normal work.
   
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :0.12.2
   
   * Spark version :aws EMR 3.2.1
   
   * Hive version :2.3.9
   
   * Hadoop version :3.2.1
   
   * Storage (HDFS/S3/GCS..) :s3
   
   * Running on Docker? (yes/no) :no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   3/06/09 02:39:16 INFO BlockManagerInfo: Added rdd_7_16 in memory on 
ip-172-16-13-109.us-west-2.compute.internal:42539 (size: 2.2 KiB, free: 8.4 GiB)
   23/06/09 02:39:16 WARN TaskSetManager: Lost task 93.3 in stage 0.0 (TID 110) 
(ip-172-16-12-181.us-west-2.compute.internal executor 1): 
org.apache.hudi.exception.HoodieIOException: IOException when reading log file 
at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:374)
at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:220)
at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:209)
at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:113)
at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:106)
at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:343)
at org.apache.hudi.LogFileIterator$.scanLog(LogFileIterator.scala:305)
at org.apache.hudi.LogFileIterator.(LogFileIterator.scala:89)
at 
org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:96)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at 
org.apache.spark.sql.execution.SQLExecutionRDD.compute(SQLExecutionRDD.scala:55)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at 
org.apache.spark.sql.execution.SQLConfInjectingRDD.compute(SQLConfInjectingRDD.scala:58)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
at 
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1440)
at 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at 

[hudi] branch master updated (80e0b557ffe -> 593181397e2)

2023-06-08 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 80e0b557ffe [HUDI-6310] 
CreateHoodieTableCommand::createHiveDataSourceTable arguments refactor (#8874)
 add 593181397e2 [HUDI-5352] Fix `LocalDate` serialization in colstats 
(#8840)

No new revisions were added by this update.

Summary of changes:
 hudi-common/pom.xml  |  4 
 .../java/org/apache/hudi/common/util/JsonUtils.java  |  8 
 .../hudi/functional/TestColumnStatsIndex.scala   |  7 ++-
 packaging/hudi-flink-bundle/pom.xml  |  3 ++-
 packaging/hudi-hadoop-mr-bundle/pom.xml  | 13 +
 packaging/hudi-hive-sync-bundle/pom.xml  |  3 +++
 packaging/hudi-integ-test-bundle/pom.xml |  5 +++--
 packaging/hudi-kafka-connect-bundle/pom.xml  | 13 +
 packaging/hudi-spark-bundle/pom.xml  |  3 +++
 packaging/hudi-timeline-server-bundle/pom.xml|  9 +
 packaging/hudi-utilities-bundle/pom.xml  |  3 +++
 pom.xml  | 20 
 12 files changed, 83 insertions(+), 8 deletions(-)



[GitHub] [hudi] yihua merged pull request #8840: [HUDI-5352] Fix `LocalDate` serialization in colstats

2023-06-08 Thread via GitHub


yihua merged PR #8840:
URL: https://github.com/apache/hudi/pull/8840


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8900: [HUDI-6334] Integrate logcompaction table service to metadata table and provides various bugfixes to metadata table

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8900:
URL: https://github.com/apache/hudi/pull/8900#issuecomment-1583854119

   
   ## CI report:
   
   * fe74a9a7d32286ae29ded9370f6d53ccb14c8809 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17677)
 
   * f9e3b8dd406a43d5808ee93105efb9154b05a6cb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17691)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zyclove commented on issue #8903: [SUPPORT] aws spark3.2.1 & hudi 0.13.1 with java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile

2023-06-08 Thread via GitHub


zyclove commented on issue #8903:
URL: https://github.com/apache/hudi/issues/8903#issuecomment-1583849913

   @umehrot2 
   hi,Hudi Experts, can anyone help me? The prod env problem is more urgent, 
looking forward to an early reply.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zyclove commented on issue #8904: [SUPPORT] spark-sql hudi table Caused by: org.apache.avro.AvroTypeException: Found string, expecting union

2023-06-08 Thread via GitHub


zyclove commented on issue #8904:
URL: https://github.com/apache/hudi/issues/8904#issuecomment-1583848717

   hi,Hudi Experts, can anyone help me? The prod env problem is more urgent, 
looking forward to an early reply.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8911: [Hudi-8882] Compatible with hive 2.2.x to read hudi rt table

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8911:
URL: https://github.com/apache/hudi/pull/8911#issuecomment-1583844595

   
   ## CI report:
   
   * 1ddc84cab970a6a43ea77a729213dc8c5200d845 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17690)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8900: [HUDI-6334] Integrate logcompaction table service to metadata table and provides various bugfixes to metadata table

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8900:
URL: https://github.com/apache/hudi/pull/8900#issuecomment-1583844504

   
   ## CI report:
   
   * fe74a9a7d32286ae29ded9370f6d53ccb14c8809 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17677)
 
   * f9e3b8dd406a43d5808ee93105efb9154b05a6cb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8840: [HUDI-5352] Fix `LocalDate` serialization in colstats

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8840:
URL: https://github.com/apache/hudi/pull/8840#issuecomment-1583844187

   
   ## CI report:
   
   * 80b25e613cbcdf8f3e1efe39436cad173163d9d9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17686)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] boneanxs commented on a diff in pull request #8452: [HUDI-6077] Add more partition push down filters

2023-06-08 Thread via GitHub


boneanxs commented on code in PR #8452:
URL: https://github.com/apache/hudi/pull/8452#discussion_r1223761168


##
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##
@@ -50,20 +56,25 @@
 /**
  * Implementation of {@link HoodieTableMetadata} based file-system-backed 
table metadata.
  */
-public class FileSystemBackedTableMetadata implements HoodieTableMetadata {
+public class FileSystemBackedTableMetadata extends AbstractHoodieTableMetadata 
{
 
   private static final int DEFAULT_LISTING_PARALLELISM = 1500;
 
-  private final transient HoodieEngineContext engineContext;
-  private final SerializableConfiguration hadoopConf;
-  private final String datasetBasePath;
   private final boolean assumeDatePartitioning;
 
+  private final boolean hiveStylePartitioningEnabled;
+  private final boolean urlEncodePartitioningEnabled;
+
   public FileSystemBackedTableMetadata(HoodieEngineContext engineContext, 
SerializableConfiguration conf, String datasetBasePath,
boolean assumeDatePartitioning) {
-this.engineContext = engineContext;
-this.hadoopConf = conf;
-this.datasetBasePath = datasetBasePath;
+super(engineContext, conf, datasetBasePath);
+
+FileSystem fs = FSUtils.getFs(dataBasePath.get(), conf.get());
+Path metaPath = new Path(dataBasePath.get(), 
HoodieTableMetaClient.METAFOLDER_NAME);
+TableNotFoundException.checkTableValidity(fs, this.dataBasePath.get(), 
metaPath);
+HoodieTableConfig tableConfig = new HoodieTableConfig(fs, 
metaPath.toString(), null, null);

Review Comment:
   Move creating `HoodieTableConfig` only for `FileSystemBackedTableMetadata`, 
in case `HoodieBackedTableMetadata` creating it twice



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8911: [Hudi-8882] Compatible with hive 2.2.x to read hudi rt table

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8911:
URL: https://github.com/apache/hudi/pull/8911#issuecomment-1583838150

   
   ## CI report:
   
   * 1ddc84cab970a6a43ea77a729213dc8c5200d845 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8905: [HUDI-6337] Incremental Clean ignore partitions affected by append write commits/delta commits

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8905:
URL: https://github.com/apache/hudi/pull/8905#issuecomment-1583838091

   
   ## CI report:
   
   * f8f14263190df7b66143e192188e68463e0c1efd UNKNOWN
   * f9adcecf4e54774510569f14af4c81a1f4951a28 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17681)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17689)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] thomasg19930417 commented on issue #8882: [SUPPORT] Using hive to read rt table exception

2023-06-08 Thread via GitHub


thomasg19930417 commented on issue #8882:
URL: https://github.com/apache/hudi/issues/8882#issuecomment-1583807262

   I probably looked at the hive branch 2.0 2.1 code and it should be the same 
as 2.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] stream2000 commented on pull request #8905: [HUDI-6337] Incremental Clean ignore partitions affected by append write commits/delta commits

2023-06-08 Thread via GitHub


stream2000 commented on PR #8905:
URL: https://github.com/apache/hudi/pull/8905#issuecomment-1583798165

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] thomasg19930417 commented on pull request #8911: [Hudi-8882] Compatible with hive 2.2.x to read hudi rt table

2023-06-08 Thread via GitHub


thomasg19930417 commented on PR #8911:
URL: https://github.com/apache/hudi/pull/8911#issuecomment-1583794413

   @danny0405   Please help to review the code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8910: [HUDI-6342] Fixing flaky Continuous mode multi writer tests

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8910:
URL: https://github.com/apache/hudi/pull/8910#issuecomment-1583793617

   
   ## CI report:
   
   * fc4825fd3b646e3b69322b386fa4b2fd4f19ba67 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17687)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] thomasg19930417 commented on issue #8882: [SUPPORT] Using hive to read rt table exception

2023-06-08 Thread via GitHub


thomasg19930417 commented on issue #8882:
URL: https://github.com/apache/hudi/issues/8882#issuecomment-1583792529

   @danny0405 I submitted a pr to be compatible with hive2.2, copied part of 
the code of hive2.3 to hudi, and converted the data structure of hive2.2 to the 
form in 2.3 for processing. I don’t know how to do this Is it reasonable and 
there may be some problems in the code, please help to review the code  #8911


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 merged pull request #8874: [HUDI-6310] CreateHoodieTableCommand::createHiveDataSourceTable arguments refactor

2023-06-08 Thread via GitHub


danny0405 merged PR #8874:
URL: https://github.com/apache/hudi/pull/8874


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated (7ae8da02d12 -> 80e0b557ffe)

2023-06-08 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 7ae8da02d12 [HUDI-6200] Enhancements to the MDT for improving 
performance of larger indexes. (#8684)
 add 80e0b557ffe [HUDI-6310] 
CreateHoodieTableCommand::createHiveDataSourceTable arguments refactor (#8874)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/hudi/command/CreateHoodieTableCommand.scala | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)



[GitHub] [hudi] danny0405 commented on issue #8906: [SUPPORT] hudi upsert error: java.lang.NumberFormatException: For input string: "d880d4ea"

2023-06-08 Thread via GitHub


danny0405 commented on issue #8906:
URL: https://github.com/apache/hudi/issues/8906#issuecomment-1583786336

   you are right, bucket index for bulk_insert in supported only in latest 
master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] thomasg19930417 opened a new pull request, #8911: Compatible with hive 2.2.x to read hudi rt table

2023-06-08 Thread via GitHub


thomasg19930417 opened a new pull request, #8911:
URL: https://github.com/apache/hudi/pull/8911

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8910: [HUDI-6342] Fixing flaky Continuous mode multi writer tests

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8910:
URL: https://github.com/apache/hudi/pull/8910#issuecomment-1583784756

   
   ## CI report:
   
   * fc4825fd3b646e3b69322b386fa4b2fd4f19ba67 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.

2023-06-08 Thread via GitHub


danny0405 commented on code in PR #8684:
URL: https://github.com/apache/hudi/pull/8684#discussion_r1223719557


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1052,51 +1091,81 @@ protected HoodieData 
prepRecords(Map
+   * Don't perform optimization if there are inflight operations on the 
dataset. This is for two reasons:
+   * - The compaction will contain the correct data as all failed operations 
have been rolled back.
+   * - Clean/compaction etc. will have the highest timestamp on the MDT and we 
won't be adding new operations
+   * with smaller timestamps to metadata table (makes for easier debugging)
+   * 
+   * This adds the limitations that long-running async operations (clustering, 
etc.) may cause delay in such MDT
+   * optimizations. We will relax this after MDT code has been hardened.
*/
-  protected void compactIfNecessary(BaseHoodieWriteClient writeClient, String 
instantTime) {
-// finish off any pending compactions if any from previous attempt.
-writeClient.runAnyPendingCompactions();
-
-String latestDeltaCommitTimeInMetadataTable = 
metadataMetaClient.reloadActiveTimeline()
-.getDeltaCommitTimeline()
-.filterCompletedInstants()
-.lastInstant().orElseThrow(() -> new HoodieMetadataException("No 
completed deltacommit in metadata table"))
-.getTimestamp();
-// we need to find if there are any inflights in data table timeline 
before or equal to the latest delta commit in metadata table.
-// Whenever you want to change this logic, please ensure all below 
scenarios are considered.
-// a. There could be a chance that latest delta commit in MDT is committed 
in MDT, but failed in DT. And so findInstantsBeforeOrEquals() should be employed
-// b. There could be DT inflights after latest delta commit in MDT and we 
are ok with it. bcoz, the contract is, latest compaction instant time in MDT 
represents
-// any instants before that is already synced with metadata table.
-// c. Do consider out of order commits. For eg, c4 from DT could complete 
before c3. and we can't trigger compaction in MDT with c4 as base instant time, 
until every
-// instant before c4 is synced with metadata table.
-List pendingInstants = 
dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested()
-
.findInstantsBeforeOrEquals(latestDeltaCommitTimeInMetadataTable).getInstants();
+  @Override
+  public void performTableServices(Option inFlightInstantTimestamp) {
+HoodieTimer metadataTableServicesTimer = HoodieTimer.start();
+boolean allTableServicesExecutedSuccessfullyOrSkipped = true;
+try {
+  BaseHoodieWriteClient writeClient = getWriteClient();
+  // Run any pending table services operations.
+  runPendingTableServicesOperations(writeClient);
+
+  // Check and run clean operations.
+  String latestDeltacommitTime = 
metadataMetaClient.reloadActiveTimeline().getDeltaCommitTimeline()
+  .filterCompletedInstants()
+  .lastInstant().get()
+  .getTimestamp();
+  LOG.info("Latest deltacommit time found is " + latestDeltacommitTime + 
", running clean operations.");
+  cleanIfNecessary(writeClient, latestDeltacommitTime);
+
+  // Do timeline validation before scheduling compaction/logcompaction 
operations.
+  if 
(!validateTimelineBeforeSchedulingCompaction(inFlightInstantTimestamp, 
latestDeltacommitTime)) {
+return;

Review Comment:
   > unless compaction in MDT kicks in, archival might not have anything to do 
after last time it was able to archive something.
   
   Then archiving will always be blocked by the compaction.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #8910: [HUDI-6342] Fixing flaky Continuous mode multi writer tests

2023-06-08 Thread via GitHub


nsivabalan commented on code in PR #8910:
URL: https://github.com/apache/hudi/pull/8910#discussion_r1223690819


##
hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamerWithMultiWriter.java:
##
@@ -404,12 +405,24 @@ private void runJobsInParallel(String tableBasePath, 
HoodieTableType tableType,
* Need to perform getMessage().contains since the exception coming
* from {@link 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.DeltaSyncService} 
gets wrapped many times into RuntimeExceptions.
*/
-  if (expectConflict && 
e.getCause().getMessage().contains(ConcurrentModificationException.class.getName()))
 {
+  if (expectConflict && backfillFailed.get() && 
e.getCause().getMessage().contains(ConcurrentModificationException.class.getName()))
 {
 // expected ConcurrentModificationException since ingestion & backfill 
will have overlapping writes
-if (backfillFailed.get()) {
+if (!continuousFailed.get()) {

Review Comment:
   NTR: in most cases backfill job fails and hence the test succeeds. but if 
continuous job fails, the test times out. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6342) Fix flaky MultiTableDeltaStreamer test

2023-06-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6342:
-
Labels: pull-request-available  (was: )

> Fix flaky MultiTableDeltaStreamer test
> --
>
> Key: HUDI-6342
> URL: https://issues.apache.org/jira/browse/HUDI-6342
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> TestHoodieDeltaStreamerWithMultiWriter.
> testUpsertsContinuousModeWithMultipleWritersForConflicts 
> is flaky in recent times. 
>  
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/17675/logs/21]
>  
> {code:java}
> 2023-06-08T14:02:50.4346417Z 798455 [pool-1655-thread-1] ERROR 
> org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamerWithMultiWriter
>  [] - Continuous job failed java.lang.RuntimeException: Ingestion service was 
> shut down with exception.
> 2023-06-08T14:02:50.4351308Z 798455 [Listener at localhost/45789] ERROR 
> org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamerWithMultiWriter
>  [] - Conflict happened, but not expected 
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
> Ingestion service was shut down with exception.
> 2023-06-08T14:02:50.7579883Z [ERROR] Tests run: 5, Failures: 0, Errors: 1, 
> Skipped: 1, Time elapsed: 201.181 s <<< FAILURE! - in 
> org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamerWithMultiWriter
> 2023-06-08T14:02:50.7615120Z [ERROR] 
> testUpsertsContinuousModeWithMultipleWritersForConflicts{HoodieTableType}[2]  
> Time elapsed: 56.062 s  <<< ERROR!
> 2023-06-08T14:02:50.7615570Z java.util.concurrent.ExecutionException: 
> java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
> java.lang.RuntimeException: Ingestion service was shut down with exception.
> 2023-06-08T14:02:50.7616039Z  at 
> java.util.concurrent.FutureTask.report(FutureTask.java:122)
> 2023-06-08T14:02:50.7616662Z  at 
> java.util.concurrent.FutureTask.get(FutureTask.java:192)
> 2023-06-08T14:02:50.7617179Z  at 
> org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamerWithMultiWriter.runJobsInParallel(TestHoodieDeltaStreamerWithMultiWriter.java:398)
> 2023-06-08T14:02:50.7617674Z  at 
> org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWritersForConflicts(TestHoodieDeltaStreamerWithMultiWriter.java:140)
> 2023-06-08T14:02:50.7618059Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2023-06-08T14:02:50.7618319Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2023-06-08T14:02:50.7618615Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2023-06-08T14:02:50.7618896Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2023-06-08T14:02:50.7619173Z  at 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
> 2023-06-08T14:02:50.7619480Z  at 
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
> 2023-06-08T14:02:50.7619845Z  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
> 2023-06-08T14:02:50.7620217Z  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
> 2023-06-08T14:02:50.7620540Z  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
> 2023-06-08T14:02:50.7620903Z  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:92)
> 2023-06-08T14:02:50.7621288Z  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
> 2023-06-08T14:02:50.7621849Z  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
> 2023-06-08T14:02:50.767Z  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
> 2023-06-08T14:02:50.7622626Z  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
> 2023-06-08T14:02:50.7623010Z  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
> 2023-06-08T14:02:50.7623375Z  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
> 2023-06-08T14:02:50.7623723Z  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
> 

[GitHub] [hudi] nsivabalan opened a new pull request, #8910: [HUDI-6342] Fixing flaky Continuous mode multi writer tests

2023-06-08 Thread via GitHub


nsivabalan opened a new pull request, #8910:
URL: https://github.com/apache/hudi/pull/8910

   ### Change Logs
   
   Fixing flaky Continuous mode multi writer tests. Exception thrown when 
continuous mode job fails is different than exception thrown while backfill job 
fails. So, had to fix the tests accounting for that. 
   
   ### Impact
   
   Fixing flaky Continuous mode multi writer tests. Exception thrown when 
continuous mode job fails is different than exception thrown while backfill job 
fails. So, had to fix the tests accounting for that. 
   
   ### Risk level (write none, low medium or high below)
   
   Stabilizes CI
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6342) Fix flaky MultiTableDeltaStreamer test

2023-06-08 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-6342:
--
Epic Link: HUDI-4302

> Fix flaky MultiTableDeltaStreamer test
> --
>
> Key: HUDI-6342
> URL: https://issues.apache.org/jira/browse/HUDI-6342
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Major
>
> TestHoodieDeltaStreamerWithMultiWriter.
> testUpsertsContinuousModeWithMultipleWritersForConflicts 
> is flaky in recent times. 
>  
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/17675/logs/21]
>  
> {code:java}
> 2023-06-08T14:02:50.4346417Z 798455 [pool-1655-thread-1] ERROR 
> org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamerWithMultiWriter
>  [] - Continuous job failed java.lang.RuntimeException: Ingestion service was 
> shut down with exception.
> 2023-06-08T14:02:50.4351308Z 798455 [Listener at localhost/45789] ERROR 
> org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamerWithMultiWriter
>  [] - Conflict happened, but not expected 
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
> Ingestion service was shut down with exception.
> 2023-06-08T14:02:50.7579883Z [ERROR] Tests run: 5, Failures: 0, Errors: 1, 
> Skipped: 1, Time elapsed: 201.181 s <<< FAILURE! - in 
> org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamerWithMultiWriter
> 2023-06-08T14:02:50.7615120Z [ERROR] 
> testUpsertsContinuousModeWithMultipleWritersForConflicts{HoodieTableType}[2]  
> Time elapsed: 56.062 s  <<< ERROR!
> 2023-06-08T14:02:50.7615570Z java.util.concurrent.ExecutionException: 
> java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
> java.lang.RuntimeException: Ingestion service was shut down with exception.
> 2023-06-08T14:02:50.7616039Z  at 
> java.util.concurrent.FutureTask.report(FutureTask.java:122)
> 2023-06-08T14:02:50.7616662Z  at 
> java.util.concurrent.FutureTask.get(FutureTask.java:192)
> 2023-06-08T14:02:50.7617179Z  at 
> org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamerWithMultiWriter.runJobsInParallel(TestHoodieDeltaStreamerWithMultiWriter.java:398)
> 2023-06-08T14:02:50.7617674Z  at 
> org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWritersForConflicts(TestHoodieDeltaStreamerWithMultiWriter.java:140)
> 2023-06-08T14:02:50.7618059Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2023-06-08T14:02:50.7618319Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2023-06-08T14:02:50.7618615Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2023-06-08T14:02:50.7618896Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2023-06-08T14:02:50.7619173Z  at 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
> 2023-06-08T14:02:50.7619480Z  at 
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
> 2023-06-08T14:02:50.7619845Z  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
> 2023-06-08T14:02:50.7620217Z  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
> 2023-06-08T14:02:50.7620540Z  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
> 2023-06-08T14:02:50.7620903Z  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:92)
> 2023-06-08T14:02:50.7621288Z  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
> 2023-06-08T14:02:50.7621849Z  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
> 2023-06-08T14:02:50.767Z  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
> 2023-06-08T14:02:50.7622626Z  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
> 2023-06-08T14:02:50.7623010Z  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
> 2023-06-08T14:02:50.7623375Z  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
> 2023-06-08T14:02:50.7623723Z  at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
> 2023-06-08T14:02:50.7624054Z  at 
> 

[jira] [Created] (HUDI-6342) Fix flaky MultiTableDeltaStreamer test

2023-06-08 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-6342:
-

 Summary: Fix flaky MultiTableDeltaStreamer test
 Key: HUDI-6342
 URL: https://issues.apache.org/jira/browse/HUDI-6342
 Project: Apache Hudi
  Issue Type: Bug
  Components: tests-ci
Reporter: sivabalan narayanan


TestHoodieDeltaStreamerWithMultiWriter.

testUpsertsContinuousModeWithMultipleWritersForConflicts 

is flaky in recent times. 

 

[https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/17675/logs/21]

 
{code:java}
2023-06-08T14:02:50.4346417Z 798455 [pool-1655-thread-1] ERROR 
org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamerWithMultiWriter 
[] - Continuous job failed java.lang.RuntimeException: Ingestion service was 
shut down with exception.
2023-06-08T14:02:50.4351308Z 798455 [Listener at localhost/45789] ERROR 
org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamerWithMultiWriter 
[] - Conflict happened, but not expected 
java.util.concurrent.ExecutionException: java.lang.RuntimeException: Ingestion 
service was shut down with exception.
2023-06-08T14:02:50.7579883Z [ERROR] Tests run: 5, Failures: 0, Errors: 1, 
Skipped: 1, Time elapsed: 201.181 s <<< FAILURE! - in 
org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamerWithMultiWriter
2023-06-08T14:02:50.7615120Z [ERROR] 
testUpsertsContinuousModeWithMultipleWritersForConflicts{HoodieTableType}[2]  
Time elapsed: 56.062 s  <<< ERROR!
2023-06-08T14:02:50.7615570Z java.util.concurrent.ExecutionException: 
java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
java.lang.RuntimeException: Ingestion service was shut down with exception.
2023-06-08T14:02:50.7616039Zat 
java.util.concurrent.FutureTask.report(FutureTask.java:122)
2023-06-08T14:02:50.7616662Zat 
java.util.concurrent.FutureTask.get(FutureTask.java:192)
2023-06-08T14:02:50.7617179Zat 
org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamerWithMultiWriter.runJobsInParallel(TestHoodieDeltaStreamerWithMultiWriter.java:398)
2023-06-08T14:02:50.7617674Zat 
org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWritersForConflicts(TestHoodieDeltaStreamerWithMultiWriter.java:140)
2023-06-08T14:02:50.7618059Zat 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2023-06-08T14:02:50.7618319Zat 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
2023-06-08T14:02:50.7618615Zat 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2023-06-08T14:02:50.7618896Zat 
java.lang.reflect.Method.invoke(Method.java:498)
2023-06-08T14:02:50.7619173Zat 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
2023-06-08T14:02:50.7619480Zat 
org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
2023-06-08T14:02:50.7619845Zat 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
2023-06-08T14:02:50.7620217Zat 
org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
2023-06-08T14:02:50.7620540Zat 
org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
2023-06-08T14:02:50.7620903Zat 
org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:92)
2023-06-08T14:02:50.7621288Zat 
org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
2023-06-08T14:02:50.7621849Zat 
org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
2023-06-08T14:02:50.767Zat 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
2023-06-08T14:02:50.7622626Zat 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
2023-06-08T14:02:50.7623010Zat 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
2023-06-08T14:02:50.7623375Zat 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
2023-06-08T14:02:50.7623723Zat 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
2023-06-08T14:02:50.7624054Zat 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
2023-06-08T14:02:50.7624409Zat 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:210)
2023-06-08T14:02:50.7624794Zat 

[jira] [Updated] (HUDI-6315) Optimize UPSERT codepath to use meta fields instead of key generation and index lookup

2023-06-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6315:
-
Labels: pull-request-available  (was: )

> Optimize UPSERT codepath to use meta fields instead of key generation and 
> index lookup
> --
>
> Key: HUDI-6315
> URL: https://issues.apache.org/jira/browse/HUDI-6315
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Amrish Lal
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nsivabalan commented on a diff in pull request #8879: [HUDI-6315] [WIP] Optimize UPSERT codepath to use meta fields instead of key generation and index lookup

2023-06-08 Thread via GitHub


nsivabalan commented on code in PR #8879:
URL: https://github.com/apache/hudi/pull/8879#discussion_r175056


##
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java:
##
@@ -233,15 +238,25 @@ public static HoodieWriteResult 
doDeletePartitionsOperation(SparkRDDWriteClient
   }
 
   public static HoodieRecord createHoodieRecord(GenericRecord gr, Comparable 
orderingVal, HoodieKey hKey,
-  String payloadClass) throws IOException {
+  String payloadClass, HoodieRecordLocation recordLocation) throws 
IOException {
 HoodieRecordPayload payload = DataSourceUtils.createPayload(payloadClass, 
gr, orderingVal);
-return new HoodieAvroRecord<>(hKey, payload);
+
+HoodieAvroRecord record = new HoodieAvroRecord<>(hKey, payload);
+if (recordLocation != null) {
+  record.setCurrentLocation(recordLocation);
+}
+return record;
   }
 
+  // AKL_TODO: check if this change is needed. Also validate change if needed.
   public static HoodieRecord createHoodieRecord(GenericRecord gr, HoodieKey 
hKey,
-String payloadClass) throws 
IOException {
+String payloadClass, 
HoodieRecordLocation recordLocation) throws IOException {

Review Comment:
   same here



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##
@@ -144,20 +144,25 @@ class DefaultSource extends RelationProvider
   mode: SaveMode,
   optParams: Map[String, String],
   df: DataFrame): BaseRelation = {
-val dfWithoutMetaCols = 
df.drop(HoodieRecord.HOODIE_META_COLUMNS.asScala:_*)
+val dfPrepped = if (optParams.getOrDefault(DATASOURCE_WRITE_PREPPED_KEY, 
"false")

Review Comment:
   dfPrepped -> processedDf



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -1160,21 +1171,29 @@ object HoodieSparkSqlWriter {
 
   // handle dropping partition columns
   it.map { avroRec =>
-val processedRecord = if (shouldDropPartitionColumns) {
-  HoodieAvroUtils.rewriteRecord(avroRec, dataFileSchema)
+val (hoodieKey: HoodieKey, recordLocation: 
Option[HoodieRecordLocation]) = getKeyAndLocatorFromAvroRecord(keyGenerator, 
avroRec,
+  isPrepped)
+
+val avroRecWithoutMeta: GenericRecord = if (isPrepped) {
+  HoodieAvroUtils.rewriteRecord(avroRec, 
HoodieAvroUtils.removeMetadataFields(dataFileSchema))
 } else {
   avroRec
 }
 
-val hoodieKey = new HoodieKey(keyGenerator.getRecordKey(avroRec), 
keyGenerator.getPartitionPath(avroRec))
+val processedRecord = if (shouldDropPartitionColumns) {
+  HoodieAvroUtils.rewriteRecord(avroRecWithoutMeta, dataFileSchema)
+} else {
+  avroRecWithoutMeta
+}
+
 val hoodieRecord = if (shouldCombine) {
   val orderingVal = HoodieAvroUtils.getNestedFieldVal(avroRec, 
config.getString(PRECOMBINE_FIELD),
 false, 
consistentLogicalTimestampEnabled).asInstanceOf[Comparable[_]]
   DataSourceUtils.createHoodieRecord(processedRecord, orderingVal, 
hoodieKey,
-config.getString(PAYLOAD_CLASS_NAME))
+config.getString(PAYLOAD_CLASS_NAME), 
recordLocation.getOrElse(null))
 } else {
-  DataSourceUtils.createHoodieRecord(processedRecord, hoodieKey,
-config.getString(PAYLOAD_CLASS_NAME))
+  // AKL_TODO: check if this change is needed.

Review Comment:
   fix the comments



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -1195,18 +1214,108 @@ object HoodieSparkSqlWriter {
   }
   val sparkKeyGenerator = 
HoodieSparkKeyGeneratorFactory.createKeyGenerator(keyGenProps).asInstanceOf[SparkKeyGeneratorInterface]
   val targetStructType = if (shouldDropPartitionColumns) 
dataFileStructType else writerStructType
+  val finalStructType = if (isPrepped) {
+val fieldsToExclude = 
HoodieRecord.HOODIE_META_COLUMNS_WITH_OPERATION.toArray()
+StructType(targetStructType.fields.filterNot(field => 
fieldsToExclude.contains(field.name)))
+  } else {
+targetStructType
+  }
   // NOTE: To make sure we properly transform records
-  val targetStructTypeRowWriter = 
getCachedUnsafeRowWriter(sourceStructType, targetStructType)
+  val finalStructTypeRowWriter = 
getCachedUnsafeRowWriter(sourceStructType, finalStructType)
 
   it.map { sourceRow =>
-val recordKey = sparkKeyGenerator.getRecordKey(sourceRow, 
sourceStructType)
-val partitionPath = 

[GitHub] [hudi] yihua commented on pull request #8885: [HUDI-6198] Support Hudi on Spark 3.4.0

2023-06-08 Thread via GitHub


yihua commented on PR #8885:
URL: https://github.com/apache/hudi/pull/8885#issuecomment-1583654629

   Hi @zhangyue19921010 @xiarixiaoyao @nsivabalan @xushiyan @danny0405, could 
you also review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #8885: [HUDI-6198] Support Hudi on Spark 3.4.0

2023-06-08 Thread via GitHub


yihua commented on code in PR #8885:
URL: https://github.com/apache/hudi/pull/8885#discussion_r1223651454


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/command/index/TestIndexSyntax.scala:
##
@@ -56,30 +58,37 @@ class TestIndexSyntax extends HoodieSparkSqlTestBase {
 
 var logicalPlan = sqlParser.parsePlan(s"show indexes from 
default.$tableName")
 var resolvedLogicalPlan = analyzer.execute(logicalPlan)
-
assertResult(s"`default`.`$tableName`")(resolvedLogicalPlan.asInstanceOf[ShowIndexesCommand].table.identifier.quotedString)

Review Comment:
   FR: `table.identifier.quotedString` now also has catalog name as the prefix.



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##
@@ -733,8 +734,8 @@ object HoodieBaseRelation extends SparkAdapterSupport {
 
 partitionedFile => {
   val hadoopConf = hadoopConfBroadcast.value.get()
-  val reader = new HoodieAvroHFileReader(hadoopConf, new 
Path(partitionedFile.filePath),
-new CacheConfig(hadoopConf))
+  val filePath = 
sparkAdapter.getSparkPartitionedFileUtils.getPathFromPartitionedFile(partitionedFile)

Review Comment:
   For Reviewer (FR): all the changes in the common module of introducing new 
adapter support are because of Spark 3.4 class and API changes.



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetFileFormat.scala:
##
@@ -34,6 +34,15 @@ class HoodieParquetFileFormat extends ParquetFileFormat with 
SparkAdapterSupport
 
   override def toString: String = "Hoodie-Parquet"
 
+  override def supportBatch(sparkSession: SparkSession, schema: StructType): 
Boolean = {

Review Comment:
   FR: Spark 3.4 now supports vectorized reader on nested fields.  However, 
Hudi does not support this yet due to custom schema evolution logic.  So we add 
logic to override `supportBatch` in `HoodieParquetFileFormat` for Spark 3.4.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8885: [HUDI-6198] Support Hudi on Spark 3.4.0

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8885:
URL: https://github.com/apache/hudi/pull/8885#issuecomment-1583605229

   
   ## CI report:
   
   * e2f44f2a1f574eed79090b337d7bd56e08058b51 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17680)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8840: [HUDI-5352] Fix `LocalDate` serialization in colstats

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8840:
URL: https://github.com/apache/hudi/pull/8840#issuecomment-1583605102

   
   ## CI report:
   
   * 29e4627d6ec492fb19b64777fbc4ae8e2091d6e0 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17685)
 
   * 80b25e613cbcdf8f3e1efe39436cad173163d9d9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17686)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] mpouttu commented on issue #498: Is there any record delete examples?

2023-06-08 Thread via GitHub


mpouttu commented on issue #498:
URL: https://github.com/apache/hudi/issues/498#issuecomment-1583601589

   _hoodie_is_deleted allows us to delete records and replace them with new 
records in the same transaction which is essential for some of our use cases. 
EmptyHoodieRecordPayload forces us to do the deletes in a separate commit to 
the inserts which will cause the balances not to tie out for GL accounts for 
example. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8840: [HUDI-5352] Fix `LocalDate` serialization in colstats

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8840:
URL: https://github.com/apache/hudi/pull/8840#issuecomment-1583596337

   
   ## CI report:
   
   * 3461a1e2fbcc7b51e06f4bf803b6753466396c95 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17471)
 
   * 29e4627d6ec492fb19b64777fbc4ae8e2091d6e0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17685)
 
   * 80b25e613cbcdf8f3e1efe39436cad173163d9d9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8840: [HUDI-5352] Fix `LocalDate` serialization in colstats

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8840:
URL: https://github.com/apache/hudi/pull/8840#issuecomment-1583582258

   
   ## CI report:
   
   * 3461a1e2fbcc7b51e06f4bf803b6753466396c95 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17471)
 
   * 29e4627d6ec492fb19b64777fbc4ae8e2091d6e0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17685)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8840: [HUDI-5352] Fix `LocalDate` serialization in colstats

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8840:
URL: https://github.com/apache/hudi/pull/8840#issuecomment-1583486543

   
   ## CI report:
   
   * 3461a1e2fbcc7b51e06f4bf803b6753466396c95 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17471)
 
   * 29e4627d6ec492fb19b64777fbc4ae8e2091d6e0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8885: [HUDI-6198] Support Hudi on Spark 3.4.0

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8885:
URL: https://github.com/apache/hudi/pull/8885#issuecomment-1583469483

   
   ## CI report:
   
   * e2f44f2a1f574eed79090b337d7bd56e08058b51 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17680)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8885: [HUDI-6198] Support Hudi on Spark 3.4.0

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8885:
URL: https://github.com/apache/hudi/pull/8885#issuecomment-1583463698

   
   ## CI report:
   
   * e2f44f2a1f574eed79090b337d7bd56e08058b51 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #8840: [HUDI-5352] Fix `LocalDate` serialization in colstats

2023-06-08 Thread via GitHub


yihua commented on code in PR #8840:
URL: https://github.com/apache/hudi/pull/8840#discussion_r1223597615


##
hudi-common/src/main/java/org/apache/hudi/common/util/JsonUtils.java:
##
@@ -20,41 +20,74 @@
 package org.apache.hudi.common.util;
 
 import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.util.Lazy;
 
 import com.fasterxml.jackson.annotation.JsonAutoDetect;
 import com.fasterxml.jackson.annotation.PropertyAccessor;
 import com.fasterxml.jackson.core.JsonProcessingException;
 import com.fasterxml.jackson.databind.DeserializationFeature;
 import com.fasterxml.jackson.databind.ObjectMapper;
+import com.fasterxml.jackson.databind.SerializationFeature;
+import com.fasterxml.jackson.databind.util.StdDateFormat;
+import com.fasterxml.jackson.datatype.jsr310.JavaTimeModule;
 
 /**
  * Utils for JSON serialization and deserialization.
  */
 public class JsonUtils {
 
-  private static final ObjectMapper MAPPER = new ObjectMapper();
-
-  static {
-MAPPER.disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES);
-// We need to exclude custom getters, setters and creators which can use 
member fields
-// to derive new fields, so that they are not included in the serialization
-MAPPER.setVisibility(PropertyAccessor.FIELD, 
JsonAutoDetect.Visibility.ANY);
-MAPPER.setVisibility(PropertyAccessor.GETTER, 
JsonAutoDetect.Visibility.NONE);
-MAPPER.setVisibility(PropertyAccessor.IS_GETTER, 
JsonAutoDetect.Visibility.NONE);
-MAPPER.setVisibility(PropertyAccessor.SETTER, 
JsonAutoDetect.Visibility.NONE);
-MAPPER.setVisibility(PropertyAccessor.CREATOR, 
JsonAutoDetect.Visibility.NONE);
-  }
+  private static final Lazy MAPPER = 
Lazy.lazily(JsonUtils::instantiateObjectMapper);
 
   public static ObjectMapper getObjectMapper() {
-return MAPPER;
+return MAPPER.get();
   }
 
   public static String toString(Object value) {
 try {
-  return MAPPER.writeValueAsString(value);
+  return MAPPER.get().writeValueAsString(value);
 } catch (JsonProcessingException e) {
   throw new HoodieIOException(
   "Fail to convert the class: " + value.getClass().getName() + " to 
Json String", e);
 }
   }
+
+  private static ObjectMapper instantiateObjectMapper() {
+ObjectMapper mapper = new ObjectMapper();
+
+registerModules(mapper);
+
+// We're writing out dates as their string representations instead of 
(int) timestamps
+mapper.disable(SerializationFeature.WRITE_DATES_AS_TIMESTAMPS);
+// NOTE: This is necessary to make sure that w/ Jackson >= 2.11 colon is 
not infixed
+//   into the timezone value ("+00:00" as opposed to "+" before 
2.11)
+//   While Jackson is able to parse both of these formats, we keep it 
as false
+//   to make sure metadata produced by Hudi stays consistent across 
Jackson versions
+configureColonInTimezone(mapper);

Review Comment:
   I think we serialize the column stats to the metadata record payload, 
correct?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex opened a new pull request, #8909: [HUDI-6311] Insert Into updated behavior

2023-06-08 Thread via GitHub


jonvex opened a new pull request, #8909:
URL: https://github.com/apache/hudi/pull/8909

   ### Change Logs
   
   Insert into updated for new behavior
   Insert overwrite updated for current behavior 
https://issues.apache.org/jira/browse/HUDI-6021
   Create table updated for pkless
   ### Impact
   
   Website change
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #8885: [HUDI-6198] Support Hudi on Spark 3.4.0

2023-06-08 Thread via GitHub


yihua commented on code in PR #8885:
URL: https://github.com/apache/hudi/pull/8885#discussion_r1223580325


##
hudi-spark-datasource/hudi-spark3.2.x/src/main/scala/org/apache/spark/sql/HoodieSpark32CatalystPlanUtils.scala:
##
@@ -38,6 +36,14 @@ object HoodieSpark32CatalystPlanUtils extends 
HoodieSpark3CatalystPlanUtils {
   case _ => None
 }
 
+  override def unapplyMergeIntoTable(plan: LogicalPlan): Option[(LogicalPlan, 
LogicalPlan, Expression)] = {
+plan match {
+  case MergeIntoTable(targetTable, sourceTable, mergeCondition, _, _) =>
+Some((targetTable, sourceTable, mergeCondition))

Review Comment:
   The inner pair of parentheses is for Scala tuple, which is required.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #8885: [HUDI-6198] Support Hudi on Spark 3.4.0

2023-06-08 Thread via GitHub


yihua commented on code in PR #8885:
URL: https://github.com/apache/hudi/pull/8885#discussion_r1223579632


##
hudi-common/src/main/java/org/apache/hudi/common/util/JsonUtils.java:
##
@@ -35,6 +36,8 @@ public class JsonUtils {
   private static final ObjectMapper MAPPER = new ObjectMapper();
 
   static {
+registerModules(MAPPER);
+

Review Comment:
   #8840 contains some minor improvements which I'm currently not inclined to 
include.  I'll revise #8840 to contain necessary fixes and then land it before 
this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] pthalasta commented on issue #8901: [SUPPORT] Spark job never terminates

2023-06-08 Thread via GitHub


pthalasta commented on issue #8901:
URL: https://github.com/apache/hudi/issues/8901#issuecomment-1583381245

   I was able to add some env variable as mentioned in the warning message, 
however, the job never terminates and these are the last few lines of the logs 
that i see
   
   ```
   23/06/08 13:54:10 INFO ClusteringUtils: Found 0 files in pending clustering 
operations
   23/06/08 13:54:10 INFO AbstractTableFileSystemView: Building file system 
view for partition (files)
   23/06/08 13:54:11 INFO AbstractTableFileSystemView: addFilesToView: 
NumFiles=1, NumFileGroups=1, FileGroupsCreationTime=0, StoreTimeTaken=0
   ```
   
   Can someone help me with this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] CTTY commented on a diff in pull request #8885: [HUDI-6198] Support Hudi on Spark 3.4.0

2023-06-08 Thread via GitHub


CTTY commented on code in PR #8885:
URL: https://github.com/apache/hudi/pull/8885#discussion_r1223566158


##
hudi-spark-datasource/hudi-spark3.4.x/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark34HoodieParquetFileFormat.scala:
##
@@ -0,0 +1,532 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.mapred.FileSplit
+import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+import org.apache.hadoop.mapreduce.{JobID, TaskAttemptID, TaskID, TaskType}
+import org.apache.hudi.HoodieSparkUtils
+import org.apache.hudi.client.utils.SparkInternalSchemaConverter
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.util.InternalSchemaCache
+import org.apache.hudi.common.util.StringUtils.isNullOrEmpty
+import org.apache.hudi.common.util.collection.Pair
+import org.apache.hudi.internal.schema.InternalSchema
+import org.apache.hudi.internal.schema.action.InternalSchemaMerger
+import org.apache.hudi.internal.schema.utils.{InternalSchemaUtils, SerDeHelper}
+import org.apache.parquet.filter2.compat.FilterCompat
+import org.apache.parquet.filter2.predicate.FilterApi
+import 
org.apache.parquet.format.converter.ParquetMetadataConverter.SKIP_ROW_GROUPS
+import org.apache.parquet.hadoop.{ParquetInputFormat, ParquetRecordReader}
+import org.apache.spark.TaskContext
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
+import org.apache.spark.sql.catalyst.expressions.{Cast, JoinedRow}
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.execution.WholeStageCodegenExec
+import 
org.apache.spark.sql.execution.datasources.parquet.Spark34HoodieParquetFileFormat._
+import org.apache.spark.sql.execution.datasources.{DataSourceUtils, 
PartitionedFile, RecordReaderIterator}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.sources._
+import org.apache.spark.sql.types.{AtomicType, DataType, StructField, 
StructType}
+import org.apache.spark.util.SerializableConfiguration
+/**
+ * This class is an extension of [[ParquetFileFormat]] overriding 
Spark-specific behavior
+ * that's not possible to customize in any other way
+ *
+ * NOTE: This is a version of [[AvroDeserializer]] impl from Spark 3.2.1 w/ w/ 
the following changes applied to it:
+ * 
+ *   Avoiding appending partition values to the rows read from the data 
file
+ *   Schema on-read
+ * 
+ */
+class Spark34HoodieParquetFileFormat(private val shouldAppendPartitionValues: 
Boolean) extends ParquetFileFormat {
+
+  override def supportBatch(sparkSession: SparkSession, schema: StructType): 
Boolean = {
+val conf = sparkSession.sessionState.conf
+conf.parquetVectorizedReaderEnabled && 
schema.forall(_.dataType.isInstanceOf[AtomicType])
+  }
+
+  def supportsColumnar(sparkSession: SparkSession, schema: StructType): 
Boolean = {
+val conf = sparkSession.sessionState.conf
+// Only output columnar if there is WSCG to read it.
+val requiredWholeStageCodegenSettings =
+  conf.wholeStageEnabled && !WholeStageCodegenExec.isTooManyFields(conf, 
schema)
+requiredWholeStageCodegenSettings &&
+  supportBatch(sparkSession, schema)
+  }
+
+  override def buildReaderWithPartitionValues(sparkSession: SparkSession,
+  dataSchema: StructType,
+  partitionSchema: StructType,
+  requiredSchema: StructType,
+  filters: Seq[Filter],
+  options: Map[String, String],
+  hadoopConf: Configuration): 
PartitionedFile => Iterator[InternalRow] = {
+hadoopConf.set(ParquetInputFormat.READ_SUPPORT_CLASS, 
classOf[ParquetReadSupport].getName)
+hadoopConf.set(
+  ParquetReadSupport.SPARK_ROW_REQUESTED_SCHEMA,
+  requiredSchema.json)
+hadoopConf.set(
+  

[GitHub] [hudi] hudi-bot commented on pull request #8907: [DNM][MINOR] Add some logs to investigate flaky testUpsertsContinuousModeWithMultipleWriters

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8907:
URL: https://github.com/apache/hudi/pull/8907#issuecomment-1583366621

   
   ## CI report:
   
   * ed947b39f1c42f690cbb79257399c1ec967859e9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17682)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] CTTY commented on a diff in pull request #8885: [HUDI-6198] Support Hudi on Spark 3.4.0

2023-06-08 Thread via GitHub


CTTY commented on code in PR #8885:
URL: https://github.com/apache/hudi/pull/8885#discussion_r1223562373


##
hudi-spark-datasource/hudi-spark3.2.x/src/main/scala/org/apache/spark/sql/HoodieSpark32CatalystPlanUtils.scala:
##
@@ -38,6 +36,14 @@ object HoodieSpark32CatalystPlanUtils extends 
HoodieSpark3CatalystPlanUtils {
   case _ => None
 }
 
+  override def unapplyMergeIntoTable(plan: LogicalPlan): Option[(LogicalPlan, 
LogicalPlan, Expression)] = {
+plan match {
+  case MergeIntoTable(targetTable, sourceTable, mergeCondition, _, _) =>
+Some((targetTable, sourceTable, mergeCondition))

Review Comment:
   nit: double parentheses



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] CTTY commented on a diff in pull request #8885: [HUDI-6198] Support Hudi on Spark 3.4.0

2023-06-08 Thread via GitHub


CTTY commented on code in PR #8885:
URL: https://github.com/apache/hudi/pull/8885#discussion_r1223554380


##
hudi-common/src/main/java/org/apache/hudi/common/util/JsonUtils.java:
##
@@ -35,6 +36,8 @@ public class JsonUtils {
   private static final ObjectMapper MAPPER = new ObjectMapper();
 
   static {
+registerModules(MAPPER);
+

Review Comment:
   This change is ported from #8840 . I assume we will need to merge that PR 
before this one?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan merged pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.

2023-06-08 Thread via GitHub


nsivabalan merged PR #8684:
URL: https://github.com/apache/hudi/pull/8684


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.

2023-06-08 Thread via GitHub


nsivabalan commented on PR #8684:
URL: https://github.com/apache/hudi/pull/8684#issuecomment-1583315765

   CI is green
   
   https://github.com/apache/hudi/assets/513218/f9ba5e86-bc6f-4b62-a14c-502151f2a188;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8885: [HUDI-6198] Support Hudi on Spark 3.4.0

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8885:
URL: https://github.com/apache/hudi/pull/8885#issuecomment-1583301619

   
   ## CI report:
   
   * e2f44f2a1f574eed79090b337d7bd56e08058b51 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17680)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8908: [DNM][MINOR] Add some logs to investigate flaky testUpsertsContinuousModeWithMultipleWriters

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8908:
URL: https://github.com/apache/hudi/pull/8908#issuecomment-1583220401

   
   ## CI report:
   
   * 9d6633418e12c8a06c7bdb3e271f535096299bd2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17683)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8847: [HUDI-2071] Support Reading Bootstrap MOR RT Table In Spark DataSource Table

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8847:
URL: https://github.com/apache/hudi/pull/8847#issuecomment-1583210976

   
   ## CI report:
   
   * fe991dc492e5bec19b4bfd91dc0b210e6b152b7a UNKNOWN
   * 818c8050bf6cab30a402bfeab83a473976c44cdd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17679)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8905: [HUDI-6337] Incremental Clean ignore partitions affected by append write commits/delta commits

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8905:
URL: https://github.com/apache/hudi/pull/8905#issuecomment-1583202447

   
   ## CI report:
   
   * f8f14263190df7b66143e192188e68463e0c1efd UNKNOWN
   * f9adcecf4e54774510569f14af4c81a1f4951a28 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17681)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8900: [HUDI-6334] Integrate logcompaction table service to metadata table and provides various bugfixes to metadata table

2023-06-08 Thread via GitHub


hudi-bot commented on PR #8900:
URL: https://github.com/apache/hudi/pull/8900#issuecomment-1583202392

   
   ## CI report:
   
   * fe74a9a7d32286ae29ded9370f6d53ccb14c8809 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17677)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >