[GitHub] [hudi] YannByron commented on issue #9032: [SUPPORT] NPE when reading from a CDC-enabled table with null values

2023-06-26 Thread via GitHub


YannByron commented on issue #9032:
URL: https://github.com/apache/hudi/issues/9032#issuecomment-1608757129

   @stp-pv nice fix. 
   `map(field.name) = record.get(idx, field.dataType)` is there possible the 
same problem? can you also fake a case to test this?  And then you can fix them 
together. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zhuanshenbsj1 closed pull request #9047: [Hudi 6422] Solve the issues of compiling dependency on Hadoop 3.1.1

2023-06-26 Thread via GitHub


zhuanshenbsj1 closed pull request #9047: [Hudi 6422] Solve the issues of 
compiling dependency on Hadoop 3.1.1
URL: https://github.com/apache/hudi/pull/9047


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …

2023-06-26 Thread via GitHub


danny0405 commented on code in PR #9035:
URL: https://github.com/apache/hudi/pull/9035#discussion_r1243097851


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java:
##
@@ -138,9 +139,35 @@ protected Path makeNewFilePath(String partitionPath, 
String fileName) {
*
* @param partitionPath Partition path
*/
-  protected void createMarkerFile(String partitionPath, String dataFileName) {
-WriteMarkersFactory.get(config.getMarkersType(), hoodieTable, instantTime)
-.create(partitionPath, dataFileName, getIOType(), config, fileId, 
hoodieTable.getMetaClient().getActiveTimeline());
+  protected void createInProgressMarkerFile(String partitionPath, String 
dataFileName, String markerInstantTime) {
+WriteMarkers writeMarkers = 
WriteMarkersFactory.get(config.getMarkersType(), hoodieTable, instantTime);
+if (!writeMarkers.doesMarkerDirExist()) {
+  throw new HoodieIOException(String.format("Marker root directory absent 
: %s/%s (%s)",
+  partitionPath, dataFileName, markerInstantTime));
+}
+if (config.enforceFinalizeWriteCheck()
+&& writeMarkers.markerExists(writeMarkers.getCompletionMarkerPath("", 
"FINALIZE_WRITE", markerInstantTime, IOType.CREATE))) {
+  throw new HoodieCorruptedDataException("Reconciliation for instant " + 
instantTime + " is completed, job is trying to re-write the data files.");
+}
+if (config.enforceCompletionMarkerCheck()
+&& 
writeMarkers.markerExists(writeMarkers.getCompletionMarkerPath(partitionPath, 
fileId, markerInstantTime, getIOType( {
+  throw new HoodieIOException("Completed marker file exists for : " + 
dataFileName + " (" + instantTime + ")");
+}
+writeMarkers.create(partitionPath, dataFileName, getIOType());
+  }
+
+  // visible for testing
+  public void createCompletedMarkerFile(String partition, String 
markerInstantTime) throws IOException {
+try {
+  WriteMarkersFactory.get(config.getMarkersType(), hoodieTable, 
instantTime)
+  .createCompletionMarker(partition, fileId, markerInstantTime, 
getIOType(), true);
+} catch (Exception e) {
+  // Clean up the data file, if the marker is already present or marker 
directories don't exist.
+  Path partitionPath = 
FSUtils.getPartitionPath(hoodieTable.getMetaClient().getBasePath(), partition);

Review Comment:
   The literals are illegible to see clearly, would try to understand the 
workflow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9049: [HUDI-6435] Add some logs to the updateHeartbeat method

2023-06-26 Thread via GitHub


hudi-bot commented on PR #9049:
URL: https://github.com/apache/hudi/pull/9049#issuecomment-1608702766

   
   ## CI report:
   
   * 7b7a50f0d0f4abd5fb36c06bd078ccef1b1cb76d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18113)
 
   * 04ef037a6fa3652fa98638c2442e4081c327dae9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18118)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …

2023-06-26 Thread via GitHub


danny0405 commented on code in PR #9035:
URL: https://github.com/apache/hudi/pull/9035#discussion_r1243093277


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -612,6 +612,20 @@ public class HoodieWriteConfig extends HoodieConfig {
   .sinceVersion("0.10.0")
   .withDocumentation("File Id Prefix provider class, that implements 
`org.apache.hudi.fileid.FileIdPrefixProvider`");
 
+  public static final ConfigProperty ENFORCE_COMPLETION_MARKER_CHECKS 
= ConfigProperty
+  .key("hoodie.markers.enforce.completion.checks")
+  .defaultValue("false")
+  .sinceVersion("0.10.0")
+  .withDocumentation("Prevents the creation of duplicate data files, when 
multiple spark tasks are racing to "
+  + "create data files and a completed data file is already present");
+
+  public static final ConfigProperty ENFORCE_FINALIZE_WRITE_CHECK = 
ConfigProperty
+  .key("hoodie.markers.enforce.finalize.write.check")
+  .defaultValue("false")
+  .sinceVersion("0.10.0")
+  .withDocumentation("When WriteStatus obj is lost due to engine related 
failures, then recomputing would involve "
+  + "re-writing all the data files. When this check is enabled it 
would block the rewrite from happening.");

Review Comment:
   > if writeStatus RDD blocks are found to be missing, execution engine 
(spark) would re-trigger the write stage (to recreate the write statuses).
   
   It seems a Spark engine specific issue? But here we put the fix in the 
writer code which could affect all the engines. May I know why the writeStatus 
RDD blocks could be missing here, can we persist it before commiting to the MDT 
?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9057: [HUDI-6446] Fixing MDT commit time parsing and RLI instantiation with MDT

2023-06-26 Thread via GitHub


hudi-bot commented on PR #9057:
URL: https://github.com/apache/hudi/pull/9057#issuecomment-1608689486

   
   ## CI report:
   
   * aea6f0bb6a55c8019f34cf9b328abef34f0a5f01 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18116)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9049: [HUDI-6435] Add some logs to the updateHeartbeat method

2023-06-26 Thread via GitHub


hudi-bot commented on PR #9049:
URL: https://github.com/apache/hudi/pull/9049#issuecomment-1608689384

   
   ## CI report:
   
   * 7b7a50f0d0f4abd5fb36c06bd078ccef1b1cb76d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18113)
 
   * 04ef037a6fa3652fa98638c2442e4081c327dae9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5884) Support bulk_insert for insert_overwrite and insert_overwrite_table

2023-06-26 Thread Hui An (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hui An updated HUDI-5884:
-
Fix Version/s: 0.14.0

> Support bulk_insert for insert_overwrite and insert_overwrite_table
> ---
>
> Key: HUDI-5884
> URL: https://issues.apache.org/jira/browse/HUDI-5884
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5692) SpillableMapBasePath should be lazily loaded

2023-06-26 Thread Hui An (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hui An updated HUDI-5692:
-
Fix Version/s: 0.14.0
   (was: 0.13.0)

> SpillableMapBasePath should be lazily loaded
> 
>
> Key: HUDI-5692
> URL: https://issues.apache.org/jira/browse/HUDI-5692
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> If we use {{withInferFunction}} to set the default value of 
> {{{}SPILLABLE_MAP_BASE_PATH{}}}, this default value will be set to 
> {{{}HoodieWriteConfig{}}}'s {{{}properties{}}}, and will be serialized to all 
> executors. This could introduce the issue that if the driver doesn't have the 
> same temporary location with the executors side(e.g. driver: /mnt/disk1, 
> executor: /mnt/disk2), the executor would throw error to create the spilled 
> map path(since the executor machine doesn't have the directory /mnt/disk1).
> {code:java}
> Caused by: org.apache.hudi.exception.HoodieIOException: Unable to create 
> :/mnt/ssd/0/yarn/nm-local-dir/usercache/test/appcache/application_1673593627114_3970647/hudi-BITCASK-e3741235-6571-4112-8b20-271408148238
>   at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.getDiskBasedMap(ExternalSpillableMap.java:119)
>   at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.getDiskBasedMapNumEntries(ExternalSpillableMap.java:138)
>   at org.apache.hudi.io.HoodieMergeHandle.init(HoodieMergeHandle.java:268)
>   at org.apache.hudi.io.HoodieMergeHandle.(HoodieMergeHandle.java:129)
>   at org.apache.hudi.io.HoodieMergeHandle.(HoodieMergeHandle.java:121)
>   at org.apache.hudi.io.HoodieConcatHandle.(HoodieConcatHandle.java:81)
>   at 
> org.apache.hudi.io.HoodieMergeHandleFactory.create(HoodieMergeHandleFactory.java:60)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.getUpdateHandle(BaseSparkCommitActionExecutor.java:386)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:363)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:330)
>   ... 29 more
> Caused by: java.io.IOException: Unable to create 
> :/mnt/ssd/0/yarn/nm-local-dir/usercache/test/appcache/application_1673593627114_3970647/hudi-BITCASK-e3741235-6571-4112-8b20-271408148238
>   at org.apache.hudi.common.util.FileIOUtils.mkdir(FileIOUtils.java:70)
>   at org.apache.hudi.common.util.collection.DiskMap.(DiskMap.java:55)
>   at 
> org.apache.hudi.common.util.collection.BitCaskDiskMap.(BitCaskDiskMap.java:98)
>   at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.getDiskBasedMap(ExternalSpillableMap.java:116)
>   ... 38 more
>  
> {code}
> A better solution is to calculate the temporary location when calling 
> {{getSpillableMapBasePath}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] xuzifu666 closed pull request #9054: [HUDI-6442] TestPartialUpdateForMergeInto should support mor

2023-06-26 Thread via GitHub


xuzifu666 closed pull request #9054: [HUDI-6442] TestPartialUpdateForMergeInto 
should support mor
URL: https://github.com/apache/hudi/pull/9054


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xuzifu666 commented on pull request #9054: [HUDI-6442] TestPartialUpdateForMergeInto should support mor

2023-06-26 Thread via GitHub


xuzifu666 commented on PR #9054:
URL: https://github.com/apache/hudi/pull/9054#issuecomment-1608673153

   currently patialupdate not support in mergeinto,so close the pr


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5692) SpillableMapBasePath should be lazily loaded

2023-06-26 Thread Hui An (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hui An updated HUDI-5692:
-
Fix Version/s: 0.13.0

> SpillableMapBasePath should be lazily loaded
> 
>
> Key: HUDI-5692
> URL: https://issues.apache.org/jira/browse/HUDI-5692
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> If we use {{withInferFunction}} to set the default value of 
> {{{}SPILLABLE_MAP_BASE_PATH{}}}, this default value will be set to 
> {{{}HoodieWriteConfig{}}}'s {{{}properties{}}}, and will be serialized to all 
> executors. This could introduce the issue that if the driver doesn't have the 
> same temporary location with the executors side(e.g. driver: /mnt/disk1, 
> executor: /mnt/disk2), the executor would throw error to create the spilled 
> map path(since the executor machine doesn't have the directory /mnt/disk1).
> {code:java}
> Caused by: org.apache.hudi.exception.HoodieIOException: Unable to create 
> :/mnt/ssd/0/yarn/nm-local-dir/usercache/test/appcache/application_1673593627114_3970647/hudi-BITCASK-e3741235-6571-4112-8b20-271408148238
>   at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.getDiskBasedMap(ExternalSpillableMap.java:119)
>   at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.getDiskBasedMapNumEntries(ExternalSpillableMap.java:138)
>   at org.apache.hudi.io.HoodieMergeHandle.init(HoodieMergeHandle.java:268)
>   at org.apache.hudi.io.HoodieMergeHandle.(HoodieMergeHandle.java:129)
>   at org.apache.hudi.io.HoodieMergeHandle.(HoodieMergeHandle.java:121)
>   at org.apache.hudi.io.HoodieConcatHandle.(HoodieConcatHandle.java:81)
>   at 
> org.apache.hudi.io.HoodieMergeHandleFactory.create(HoodieMergeHandleFactory.java:60)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.getUpdateHandle(BaseSparkCommitActionExecutor.java:386)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:363)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:330)
>   ... 29 more
> Caused by: java.io.IOException: Unable to create 
> :/mnt/ssd/0/yarn/nm-local-dir/usercache/test/appcache/application_1673593627114_3970647/hudi-BITCASK-e3741235-6571-4112-8b20-271408148238
>   at org.apache.hudi.common.util.FileIOUtils.mkdir(FileIOUtils.java:70)
>   at org.apache.hudi.common.util.collection.DiskMap.(DiskMap.java:55)
>   at 
> org.apache.hudi.common.util.collection.BitCaskDiskMap.(BitCaskDiskMap.java:98)
>   at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.getDiskBasedMap(ExternalSpillableMap.java:116)
>   ... 38 more
>  
> {code}
> A better solution is to calculate the temporary location when calling 
> {{getSpillableMapBasePath}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] xuzifu666 commented on pull request #9052: [HUDI-6439] DirectWriteMarkers create file need judge appendfile whether exist

2023-06-26 Thread via GitHub


xuzifu666 commented on PR #9052:
URL: https://github.com/apache/hudi/pull/9052#issuecomment-1608665812

   > > it seems to be the same problem which should be fixed by a previous PR. 
May wait for further feedback.
   > 
   > Thanks so much for the help.
   
   I close the pr firstly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.

2023-06-26 Thread via GitHub


nsivabalan commented on code in PR #8837:
URL: https://github.com/apache/hudi/pull/8837#discussion_r1243081122


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -851,26 +919,49 @@ public void update(HoodieRestoreMetadata restoreMetadata, 
String instantTime) {
*/
   @Override
   public void update(HoodieRollbackMetadata rollbackMetadata, String 
instantTime) {
-if (enabled && metadata != null) {
-  // Is this rollback of an instant that has been synced to the metadata 
table?
-  String rollbackInstant = rollbackMetadata.getCommitsRollback().get(0);
-  boolean wasSynced = 
metadataMetaClient.getActiveTimeline().containsInstant(new HoodieInstant(false, 
HoodieTimeline.DELTA_COMMIT_ACTION, rollbackInstant));
-  if (!wasSynced) {
-// A compaction may have taken place on metadata table which would 
have included this instant being rolled back.
-// Revisit this logic to relax the compaction fencing : 
https://issues.apache.org/jira/browse/HUDI-2458
-Option latestCompaction = metadata.getLatestCompactionTime();
-if (latestCompaction.isPresent()) {
-  wasSynced = HoodieTimeline.compareTimestamps(rollbackInstant, 
HoodieTimeline.LESSER_THAN_OR_EQUALS, latestCompaction.get());
-}
+// The commit which is being rolled back on the dataset
+final String commitInstantTime = 
rollbackMetadata.getCommitsRollback().get(0);
+// Find the deltacommits since the last compaction
+Option> deltaCommitsInfo =
+
CompactionUtils.getDeltaCommitsSinceLatestCompaction(metadataMetaClient.getActiveTimeline());
+if (!deltaCommitsInfo.isPresent()) {
+  LOG.info(String.format("Ignoring rollback of instant %s at %s since 
there are no deltacommits on MDT", commitInstantTime, instantTime));
+  return;
+}
+
+// This could be a compaction or deltacommit instant (See 
CompactionUtils.getDeltaCommitsSinceLatestCompaction)
+HoodieInstant compactionInstant = deltaCommitsInfo.get().getValue();
+HoodieTimeline deltacommitsSinceCompaction = 
deltaCommitsInfo.get().getKey();
+
+// The deltacommit that will be rolled back
+HoodieInstant deltaCommitInstant = new HoodieInstant(false, 
HoodieTimeline.DELTA_COMMIT_ACTION, commitInstantTime);
+
+// The commit being rolled back should not be older than the latest 
compaction on the MDT. Compaction on MDT only occurs when all actions
+// are completed on the dataset. Hence, this case implies a rollback of 
completed commit which should actually be handled using restore.
+if (compactionInstant.getAction().equals(HoodieTimeline.COMMIT_ACTION)) {
+  final String compactionInstantTime = compactionInstant.getTimestamp();
+  if (HoodieTimeline.LESSER_THAN_OR_EQUALS.test(commitInstantTime, 
compactionInstantTime)) {
+throw new HoodieMetadataException(String.format("Commit being rolled 
back %s is older than the latest compaction %s. "
++ "There are %d deltacommits after this compaction: %s", 
commitInstantTime, compactionInstantTime,
+deltacommitsSinceCompaction.countInstants(), 
deltacommitsSinceCompaction.getInstants()));
   }
+}
 
-  Map> records =
-  HoodieTableMetadataUtil.convertMetadataToRecords(engineContext, 
metadataMetaClient.getActiveTimeline(),
-  rollbackMetadata, getRecordsGenerationParams(), instantTime,
-  metadata.getSyncedInstantTime(), wasSynced);
-  commit(instantTime, records, false);
-  closeInternal();
+if (deltaCommitsInfo.get().getKey().containsInstant(deltaCommitInstant)) {
+  LOG.info("Rolling back MDT deltacommit " + commitInstantTime);
+  if (!getWriteClient().rollback(commitInstantTime, instantTime)) {
+throw new HoodieMetadataException("Failed to rollback deltacommit at " 
+ commitInstantTime);
+  }
+} else {
+  LOG.info(String.format("Ignoring rollback of instant %s at %s since 
there are no corresponding deltacommits on MDT",
+  commitInstantTime, instantTime));
 }
+
+// Rollback of MOR table may end up adding a new log file. So we need to 
check for added files and add them to MDT
+processAndCommit(instantTime, () -> 
HoodieTableMetadataUtil.convertMetadataToRecords(engineContext, 
metadataMetaClient.getActiveTimeline(),
+rollbackMetadata, getRecordsGenerationParams(), instantTime,
+metadata.getSyncedInstantTime(), true), false);

Review Comment:
   just wanted to double confirm. in the list of valid instants we populate 
while reading MDT using Log Record Reader, we do include rollback instants from 
DT right? How this might pan out, if a async compaction from DT is rolled back 
multiple times and then finally it gets committed? 
   
   ```
   public static Set getValidInstantTimestamps(HoodieTableMetaClient 
dataMetaClient,
 

[GitHub] [hudi] xuzifu666 closed pull request #9052: [HUDI-6439] DirectWriteMarkers create file need judge appendfile whether exist

2023-06-26 Thread via GitHub


xuzifu666 closed pull request #9052: [HUDI-6439] DirectWriteMarkers create file 
need judge appendfile whether exist
URL: https://github.com/apache/hudi/pull/9052


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (HUDI-5692) SpillableMapBasePath should be lazily loaded

2023-06-26 Thread Hui An (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hui An resolved HUDI-5692.
--

> SpillableMapBasePath should be lazily loaded
> 
>
> Key: HUDI-5692
> URL: https://issues.apache.org/jira/browse/HUDI-5692
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
>  Labels: pull-request-available
>
> If we use {{withInferFunction}} to set the default value of 
> {{{}SPILLABLE_MAP_BASE_PATH{}}}, this default value will be set to 
> {{{}HoodieWriteConfig{}}}'s {{{}properties{}}}, and will be serialized to all 
> executors. This could introduce the issue that if the driver doesn't have the 
> same temporary location with the executors side(e.g. driver: /mnt/disk1, 
> executor: /mnt/disk2), the executor would throw error to create the spilled 
> map path(since the executor machine doesn't have the directory /mnt/disk1).
> {code:java}
> Caused by: org.apache.hudi.exception.HoodieIOException: Unable to create 
> :/mnt/ssd/0/yarn/nm-local-dir/usercache/test/appcache/application_1673593627114_3970647/hudi-BITCASK-e3741235-6571-4112-8b20-271408148238
>   at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.getDiskBasedMap(ExternalSpillableMap.java:119)
>   at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.getDiskBasedMapNumEntries(ExternalSpillableMap.java:138)
>   at org.apache.hudi.io.HoodieMergeHandle.init(HoodieMergeHandle.java:268)
>   at org.apache.hudi.io.HoodieMergeHandle.(HoodieMergeHandle.java:129)
>   at org.apache.hudi.io.HoodieMergeHandle.(HoodieMergeHandle.java:121)
>   at org.apache.hudi.io.HoodieConcatHandle.(HoodieConcatHandle.java:81)
>   at 
> org.apache.hudi.io.HoodieMergeHandleFactory.create(HoodieMergeHandleFactory.java:60)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.getUpdateHandle(BaseSparkCommitActionExecutor.java:386)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:363)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:330)
>   ... 29 more
> Caused by: java.io.IOException: Unable to create 
> :/mnt/ssd/0/yarn/nm-local-dir/usercache/test/appcache/application_1673593627114_3970647/hudi-BITCASK-e3741235-6571-4112-8b20-271408148238
>   at org.apache.hudi.common.util.FileIOUtils.mkdir(FileIOUtils.java:70)
>   at org.apache.hudi.common.util.collection.DiskMap.(DiskMap.java:55)
>   at 
> org.apache.hudi.common.util.collection.BitCaskDiskMap.(BitCaskDiskMap.java:98)
>   at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.getDiskBasedMap(ExternalSpillableMap.java:116)
>   ... 38 more
>  
> {code}
> A better solution is to calculate the temporary location when calling 
> {{getSpillableMapBasePath}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] danny0405 commented on pull request #9052: [HUDI-6439] DirectWriteMarkers create file need judge appendfile whether exist

2023-06-26 Thread via GitHub


danny0405 commented on PR #9052:
URL: https://github.com/apache/hudi/pull/9052#issuecomment-1608659216

   > it seems to be the same problem which should be fixed by a previous PR. 
May wait for further feedback.
   
   Thanks so much for the help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9057: [HUDI-6446] Fixing MDT commit time parsing and RLI instantiation with MDT

2023-06-26 Thread via GitHub


danny0405 commented on code in PR #9057:
URL: https://github.com/apache/hudi/pull/9057#discussion_r1243067339


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -344,6 +344,13 @@ private boolean initializeFromFilesystem(String 
initializationTime, List

[GitHub] [hudi] danny0405 commented on a diff in pull request #9049: [HUDI-6435] Add some logs to the updateHeartbeat method

2023-06-26 Thread via GitHub


danny0405 commented on code in PR #9049:
URL: https://github.com/apache/hudi/pull/9049#discussion_r1243054136


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/ClientIds.java:
##
@@ -167,6 +167,7 @@ private void updateHeartbeat(Path heartbeatFilePath) throws 
HoodieHeartbeatExcep
   this.fs.create(heartbeatFilePath, true);
   outputStream.close();
 } catch (IOException io) {
+  LOG.error("Unable to generate heartbeat,heartbeatFilePath:{}", 
heartbeatFilePath, io);
   throw new HoodieHeartbeatException("Unable to generate heartbeat ", io);
 }

Review Comment:
   We can remove the log: `LOG.error("Unable to generate 
heartbeat,heartbeatFilePath:{}", heartbeatFilePath, io);`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9049: [HUDI-6435] Add some logs to the updateHeartbeat method

2023-06-26 Thread via GitHub


danny0405 commented on code in PR #9049:
URL: https://github.com/apache/hudi/pull/9049#discussion_r1243053950


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/heartbeat/HoodieHeartbeatClient.java:
##
@@ -262,6 +262,7 @@ private void updateHeartbeat(String instantTime) throws 
HoodieHeartbeatException
   heartbeat.setLastHeartbeatTime(newHeartbeatTime);
   heartbeat.setNumHeartbeats(heartbeat.getNumHeartbeats() + 1);
 } catch (IOException io) {
+  LOG.error("Unable to generate heartbeat,instant:{}", instantTime, io);
   throw new HoodieHeartbeatException("Unable to generate heartbeat ", io);
 }

Review Comment:
   We can remove the log: `LOG.error("Unable to generate heartbeat,instant:{}", 
instantTime, io);`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5303) Allow users to control the concurrency to submit jobs in clustering

2023-06-26 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-5303:
-
Fix Version/s: 0.14.0

> Allow users to control the concurrency to submit jobs in clustering
> ---
>
> Key: HUDI-5303
> URL: https://issues.apache.org/jira/browse/HUDI-5303
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: clustering, spark
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> If there are sufficient resources in the clustering job, some clustering 
> groups sometimes could still waits to be triggered, we use forkJoinPool to 
> submit these jobs, and it's also difficult for clients to adjust this 
> configure(--conf 
> spark.driver.extraJavaOptions=-Djava.util.concurrent.ForkJoinPool.common.parallelism),
>  and it could also affect other tasks using the forkJoinPool, so instead, we 
> introduce a new threadPool to control the submitting job parallelism for the 
> clustering.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5303) Allow users to control the concurrency to submit jobs in clustering

2023-06-26 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-5303.

Resolution: Fixed

Fixed via master branch: 8eafe17a6a276b1384d2e4b528fd0abdf190bd84

> Allow users to control the concurrency to submit jobs in clustering
> ---
>
> Key: HUDI-5303
> URL: https://issues.apache.org/jira/browse/HUDI-5303
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: clustering, spark
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> If there are sufficient resources in the clustering job, some clustering 
> groups sometimes could still waits to be triggered, we use forkJoinPool to 
> submit these jobs, and it's also difficult for clients to adjust this 
> configure(--conf 
> spark.driver.extraJavaOptions=-Djava.util.concurrent.ForkJoinPool.common.parallelism),
>  and it could also affect other tasks using the forkJoinPool, so instead, we 
> introduce a new threadPool to control the submitting job parallelism for the 
> clustering.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [HUDI-5303] Allow users to control the concurrency to submit jobs in clustering (#7343)

2023-06-26 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 8eafe17a6a2 [HUDI-5303] Allow users to control the concurrency to 
submit jobs in clustering (#7343)
8eafe17a6a2 is described below

commit 8eafe17a6a276b1384d2e4b528fd0abdf190bd84
Author: Rex(Hui) An 
AuthorDate: Tue Jun 27 09:56:25 2023 +0800

[HUDI-5303] Allow users to control the concurrency to submit jobs in 
clustering (#7343)
---
 .../apache/hudi/config/HoodieClusteringConfig.java |  9 +++
 .../org/apache/hudi/config/HoodieWriteConfig.java  |  4 ++
 .../MultipleSparkJobExecutionStrategy.java | 66 +-
 3 files changed, 53 insertions(+), 26 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java
index cafed2febc6..e9ff847a6f0 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java
@@ -156,6 +156,15 @@ public class HoodieClusteringConfig extends HoodieConfig {
   .sinceVersion("0.9.0")
   .withDocumentation("Config to control frequency of async clustering");
 
+  public static final ConfigProperty CLUSTERING_MAX_PARALLELISM = 
ConfigProperty
+  .key("hoodie.clustering.max.parallelism")
+  .defaultValue(15)
+  .sinceVersion("0.14.0")
+  .withDocumentation("Maximum number of parallelism jobs submitted in 
clustering operation. "
+  + "If the resource is sufficient(Like Spark engine has enough idle 
executors), increasing this "
+  + "value will let the clustering job run faster, while it will give 
additional pressure to the "
+  + "execution engines to manage more concurrent running jobs.");
+
   public static final ConfigProperty 
PLAN_STRATEGY_SKIP_PARTITIONS_FROM_LATEST = ConfigProperty
   .key(CLUSTERING_STRATEGY_PARAM_PREFIX + 
"daybased.skipfromlatest.partitions")
   .defaultValue("0")
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
index eba9728777f..7b672abf241 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
@@ -1634,6 +1634,10 @@ public class HoodieWriteConfig extends HoodieConfig {
 return getString(HoodieClusteringConfig.PLAN_STRATEGY_CLASS_NAME);
   }
 
+  public int getClusteringMaxParallelism() {
+return getInt(HoodieClusteringConfig.CLUSTERING_MAX_PARALLELISM);
+  }
+
   public ClusteringPlanPartitionFilterMode 
getClusteringPlanPartitionFilterMode() {
 String mode = 
getString(HoodieClusteringConfig.PLAN_PARTITION_FILTER_MODE_NAME);
 return ClusteringPlanPartitionFilterMode.valueOf(mode);
diff --git 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
index 540da42fd78..c6a1df9105e 100644
--- 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
+++ 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
@@ -36,6 +36,7 @@ import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner;
 import org.apache.hudi.common.util.CollectionUtils;
+import org.apache.hudi.common.util.CustomizedThreadFactory;
 import org.apache.hudi.common.util.FutureUtils;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.StringUtils;
@@ -82,6 +83,8 @@ import java.util.Iterator;
 import java.util.List;
 import java.util.Map;
 import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
 import java.util.stream.Collectors;
 import java.util.stream.Stream;
 
@@ -105,30 +108,39 @@ public abstract class MultipleSparkJobExecutionStrategy
   public HoodieWriteMetadata> performClustering(final 
HoodieClusteringPlan clusteringPlan, final Schema schema, final String 
instantTime) {
 JavaSparkContext engineContext = 
HoodieSparkEngineContext.getSparkContext(getEngineContext());
 boolean shouldPreserveMetadata = 

[GitHub] [hudi] danny0405 merged pull request #7343: [HUDI-5303] Allow users to control the concurrency to submit jobs in clustering

2023-06-26 Thread via GitHub


danny0405 merged PR #7343:
URL: https://github.com/apache/hudi/pull/7343


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Coco0201 commented on issue #8371: [SUPPORT] Flink cant read metafield '_hoodie_commit_time'

2023-06-26 Thread via GitHub


Coco0201 commented on issue #8371:
URL: https://github.com/apache/hudi/issues/8371#issuecomment-1608574627

   > Did you declare the `_hoodie_commit_time` as a schema field in your table?
   
   I found the comma which is in the DDL of my flink table was forgotten.So 
there is no problem  while reading metafields.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9037: [HUDI-6420] Fixing Hfile on-demand and prefix based reads to use optimized apis

2023-06-26 Thread via GitHub


nsivabalan commented on code in PR #9037:
URL: https://github.com/apache/hudi/pull/9037#discussion_r1243027563


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java:
##
@@ -195,11 +193,6 @@ protected  ClosableIterator> 
lookupRecords(List keys,
 blockContentLoc.getContentPositionInLogFile(),
 blockContentLoc.getBlockSize());
 
-// HFile read will be efficient if keys are sorted, since on storage 
records are sorted by key.

Review Comment:
   sure. we can fix that. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8609: [HUDI-6154] Introduced retry while reading hoodie.properties to deal with parallel updates.

2023-06-26 Thread via GitHub


hudi-bot commented on PR #8609:
URL: https://github.com/apache/hudi/pull/8609#issuecomment-1608548374

   
   ## CI report:
   
   * e14bd41edf6cc961d77087eea67f755f23590834 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17992)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18115)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9057: [HUDI-6446] Fixing MDT commit time parsing and RLI instantiation with MDT

2023-06-26 Thread via GitHub


hudi-bot commented on PR #9057:
URL: https://github.com/apache/hudi/pull/9057#issuecomment-1608518499

   
   ## CI report:
   
   * aea6f0bb6a55c8019f34cf9b328abef34f0a5f01 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18116)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9041: [HUDI-6431] Support update partition path in record-level index

2023-06-26 Thread via GitHub


nsivabalan commented on code in PR #9041:
URL: https://github.com/apache/hudi/pull/9041#discussion_r1242967907


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##
@@ -310,6 +312,56 @@ public static  HoodieData> 
mergeForPartitionUpdates(
 return Arrays.asList(deleteRecord, getTaggedRecord(merged, 
Option.empty())).iterator();
   }
 });
-return taggedUpdatingRecords.union(newRecords);
+return taggedUpdatingRecords.union(taggedNewRecords);
+  }
+
+  public static  HoodieData> tagGlobalLocationBackToRecords(
+  HoodieData> incomingRecords,
+  HoodiePairData 
keyAndExistingLocations,
+  boolean mayContainDuplicateLookup,
+  boolean shouldUpdatePartitionPath,
+  HoodieWriteConfig config,
+  HoodieTable table) {
+final HoodieRecordMerger merger = config.getRecordMerger();
+
+HoodiePairData> keyAndIncomingRecords =
+incomingRecords.mapToPair(record -> Pair.of(record.getRecordKey(), 
record));
+
+// Pair of incoming record and the global location if meant for merged 
lookup in later stage
+HoodieData, Option>> 
incomingRecordsAndLocations
+= keyAndIncomingRecords.leftOuterJoin(keyAndExistingLocations).values()
+.map(v -> {
+  final HoodieRecord incomingRecord = v.getLeft();
+  Option currentLocOpt = 
Option.ofNullable(v.getRight().orElse(null));
+  if (currentLocOpt.isPresent()) {
+HoodieRecordGlobalLocation currentLoc = currentLocOpt.get();
+boolean shouldPerformMergedLookUp = mayContainDuplicateLookup
+|| !Objects.equals(incomingRecord.getPartitionPath(), 
currentLoc.getPartitionPath());
+if (shouldUpdatePartitionPath && shouldPerformMergedLookUp) {
+  return Pair.of(incomingRecord, currentLocOpt);
+} else {
+  // - When update partition path is set to false,
+  //   the incoming record will be tagged to the existing record's 
partition regardless of being equal or not.
+  // - When update partition path is set to true,
+  //   the incoming record will be tagged to the existing record's 
partition
+  //   when partition is not updated and the look-up won't have 
duplicates (e.g. COW, or using RLI).
+  return Pair.of((HoodieRecord) getTaggedRecord(

Review Comment:
   got it



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##
@@ -310,6 +312,56 @@ public static  HoodieData> 
mergeForPartitionUpdates(
 return Arrays.asList(deleteRecord, getTaggedRecord(merged, 
Option.empty())).iterator();
   }
 });
-return taggedUpdatingRecords.union(newRecords);
+return taggedUpdatingRecords.union(taggedNewRecords);
+  }
+
+  public static  HoodieData> tagGlobalLocationBackToRecords(
+  HoodieData> incomingRecords,
+  HoodiePairData 
keyAndExistingLocations,
+  boolean mayContainDuplicateLookup,
+  boolean shouldUpdatePartitionPath,
+  HoodieWriteConfig config,
+  HoodieTable table) {
+final HoodieRecordMerger merger = config.getRecordMerger();
+
+HoodiePairData> keyAndIncomingRecords =
+incomingRecords.mapToPair(record -> Pair.of(record.getRecordKey(), 
record));
+
+// Pair of incoming record and the global location if meant for merged 
lookup in later stage
+HoodieData, Option>> 
incomingRecordsAndLocations
+= keyAndIncomingRecords.leftOuterJoin(keyAndExistingLocations).values()
+.map(v -> {
+  final HoodieRecord incomingRecord = v.getLeft();
+  Option currentLocOpt = 
Option.ofNullable(v.getRight().orElse(null));
+  if (currentLocOpt.isPresent()) {
+HoodieRecordGlobalLocation currentLoc = currentLocOpt.get();
+boolean shouldPerformMergedLookUp = mayContainDuplicateLookup
+|| !Objects.equals(incomingRecord.getPartitionPath(), 
currentLoc.getPartitionPath());
+if (shouldUpdatePartitionPath && shouldPerformMergedLookUp) {
+  return Pair.of(incomingRecord, currentLocOpt);
+} else {
+  // - When update partition path is set to false,
+  //   the incoming record will be tagged to the existing record's 
partition regardless of being equal or not.
+  // - When update partition path is set to true,
+  //   the incoming record will be tagged to the existing record's 
partition
+  //   when partition is not updated and the look-up won't have 
duplicates (e.g. COW, or using RLI).
+  return Pair.of((HoodieRecord) getTaggedRecord(
+  createNewHoodieRecord(incomingRecord, currentLoc, 
merger), Option.of(currentLoc)),
+  Option.empty());
+}
+  } else {
+return Pair.of(getTaggedRecord(incomingRecord, Option.empty()), 

[GitHub] [hudi] hudi-bot commented on pull request #9057: [HUDI-6446] Fixing MDT commit time parsing and RLI instantiation with MDT

2023-06-26 Thread via GitHub


hudi-bot commented on PR #9057:
URL: https://github.com/apache/hudi/pull/9057#issuecomment-1608500681

   
   ## CI report:
   
   * aea6f0bb6a55c8019f34cf9b328abef34f0a5f01 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.

2023-06-26 Thread via GitHub


nsivabalan commented on code in PR #8837:
URL: https://github.com/apache/hudi/pull/8837#discussion_r1242961967


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -851,26 +919,49 @@ public void update(HoodieRestoreMetadata restoreMetadata, 
String instantTime) {
*/
   @Override
   public void update(HoodieRollbackMetadata rollbackMetadata, String 
instantTime) {
-if (enabled && metadata != null) {
-  // Is this rollback of an instant that has been synced to the metadata 
table?
-  String rollbackInstant = rollbackMetadata.getCommitsRollback().get(0);
-  boolean wasSynced = 
metadataMetaClient.getActiveTimeline().containsInstant(new HoodieInstant(false, 
HoodieTimeline.DELTA_COMMIT_ACTION, rollbackInstant));
-  if (!wasSynced) {
-// A compaction may have taken place on metadata table which would 
have included this instant being rolled back.
-// Revisit this logic to relax the compaction fencing : 
https://issues.apache.org/jira/browse/HUDI-2458
-Option latestCompaction = metadata.getLatestCompactionTime();
-if (latestCompaction.isPresent()) {
-  wasSynced = HoodieTimeline.compareTimestamps(rollbackInstant, 
HoodieTimeline.LESSER_THAN_OR_EQUALS, latestCompaction.get());
-}
+// The commit which is being rolled back on the dataset
+final String commitInstantTime = 
rollbackMetadata.getCommitsRollback().get(0);
+// Find the deltacommits since the last compaction
+Option> deltaCommitsInfo =
+
CompactionUtils.getDeltaCommitsSinceLatestCompaction(metadataMetaClient.getActiveTimeline());
+if (!deltaCommitsInfo.isPresent()) {
+  LOG.info(String.format("Ignoring rollback of instant %s at %s since 
there are no deltacommits on MDT", commitInstantTime, instantTime));
+  return;
+}
+
+// This could be a compaction or deltacommit instant (See 
CompactionUtils.getDeltaCommitsSinceLatestCompaction)
+HoodieInstant compactionInstant = deltaCommitsInfo.get().getValue();
+HoodieTimeline deltacommitsSinceCompaction = 
deltaCommitsInfo.get().getKey();
+
+// The deltacommit that will be rolled back
+HoodieInstant deltaCommitInstant = new HoodieInstant(false, 
HoodieTimeline.DELTA_COMMIT_ACTION, commitInstantTime);
+
+// The commit being rolled back should not be older than the latest 
compaction on the MDT. Compaction on MDT only occurs when all actions
+// are completed on the dataset. Hence, this case implies a rollback of 
completed commit which should actually be handled using restore.
+if (compactionInstant.getAction().equals(HoodieTimeline.COMMIT_ACTION)) {
+  final String compactionInstantTime = compactionInstant.getTimestamp();
+  if (HoodieTimeline.LESSER_THAN_OR_EQUALS.test(commitInstantTime, 
compactionInstantTime)) {
+throw new HoodieMetadataException(String.format("Commit being rolled 
back %s is older than the latest compaction %s. "
++ "There are %d deltacommits after this compaction: %s", 
commitInstantTime, compactionInstantTime,
+deltacommitsSinceCompaction.countInstants(), 
deltacommitsSinceCompaction.getInstants()));
   }
+}
 
-  Map> records =
-  HoodieTableMetadataUtil.convertMetadataToRecords(engineContext, 
metadataMetaClient.getActiveTimeline(),
-  rollbackMetadata, getRecordsGenerationParams(), instantTime,
-  metadata.getSyncedInstantTime(), wasSynced);
-  commit(instantTime, records, false);
-  closeInternal();
+if (deltaCommitsInfo.get().getKey().containsInstant(deltaCommitInstant)) {
+  LOG.info("Rolling back MDT deltacommit " + commitInstantTime);
+  if (!getWriteClient().rollback(commitInstantTime, instantTime)) {
+throw new HoodieMetadataException("Failed to rollback deltacommit at " 
+ commitInstantTime);
+  }
+} else {
+  LOG.info(String.format("Ignoring rollback of instant %s at %s since 
there are no corresponding deltacommits on MDT",
+  commitInstantTime, instantTime));
 }
+
+// Rollback of MOR table may end up adding a new log file. So we need to 
check for added files and add them to MDT
+processAndCommit(instantTime, () -> 
HoodieTableMetadataUtil.convertMetadataToRecords(engineContext, 
metadataMetaClient.getActiveTimeline(),
+rollbackMetadata, getRecordsGenerationParams(), instantTime,
+metadata.getSyncedInstantTime(), true), false);

Review Comment:
   I get it. 
   for MOR data table, rollback will add a new log file in DT. And so we need 
this to track adding the new file. 
   But can we optimize this so that this gets triggered only for MOR table or 
only when there are files to be added. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 

[jira] [Updated] (HUDI-6446) Defer Initialization of MDT just at the end of first commit

2023-06-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6446:
-
Labels: pull-request-available  (was: )

> Defer Initialization of MDT just at the end of first commit 
> 
>
> Key: HUDI-6446
> URL: https://issues.apache.org/jira/browse/HUDI-6446
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> For a fresh table, when both FILES and RLI is enabled, we use default values 
> for num file groups i.e 10 for RLI. and this also creates a log file and does 
> not create a base file since there are no records to instantiate as such. So, 
> we should defer the instantiation to later. either at the end of first commit 
> or when the data table has atleast 1 completed commit. 
> For an already existing table, this is not an issue since if there are valid 
> records, we will dynamically determine the number of file groups. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nsivabalan opened a new pull request, #9057: [HUDI-6446] Fixing MDT commit time parsing and RLI instantiation with MDT

2023-06-26 Thread via GitHub


nsivabalan opened a new pull request, #9057:
URL: https://github.com/apache/hudi/pull/9057

   ### Change Logs
   
   [HUDI-6446] Fixing MDT commit time parsing and RLI instantiation with MDT. 
   
   For a fresh table, when both FILES and RLI is enabled, we use default values 
for num file groups i.e 10 for RLI. and this also creates a log file and does 
not create a base file since there are no records to instantiate as such. So, 
we should defer the instantiation to later. either at the end of first commit 
or when the data table has atleast 1 completed commit. 
   
   For an already existing table, this is not an issue since if there are valid 
records, we will dynamically determine the number of file groups. 
   
   ### Impact
   
   Deferring instantiation of RLI for a fresh table to later when we have 
atleast 1 completed commit in DT. 
   
   ### Risk level (write none, low medium or high below)
   
   low.
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-5300) Optimize initial commit w/ metadata table

2023-06-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-5300.
-
Resolution: Fixed

> Optimize initial commit w/ metadata table
> -
>
> Key: HUDI-5300
> URL: https://issues.apache.org/jira/browse/HUDI-5300
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Initial commit w/ MDT could be huge. So, we have an opportunity to optimize 
> by leverage bulk_insert instead of regular upsert. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6446) Defer Initialization of MDT just at the end of first commit

2023-06-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-6446:
--
Epic Link: HUDI-466

> Defer Initialization of MDT just at the end of first commit 
> 
>
> Key: HUDI-6446
> URL: https://issues.apache.org/jira/browse/HUDI-6446
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Priority: Major
>
> For a fresh table, when both FILES and RLI is enabled, we use default values 
> for num file groups i.e 10 for RLI. and this also creates a log file and does 
> not create a base file since there are no records to instantiate as such. So, 
> we should defer the instantiation to later. either at the end of first commit 
> or when the data table has atleast 1 completed commit. 
> For an already existing table, this is not an issue since if there are valid 
> records, we will dynamically determine the number of file groups. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6446) Defer Initialization of MDT just at the end of first commit

2023-06-26 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-6446:
-

 Summary: Defer Initialization of MDT just at the end of first 
commit 
 Key: HUDI-6446
 URL: https://issues.apache.org/jira/browse/HUDI-6446
 Project: Apache Hudi
  Issue Type: Improvement
  Components: metadata
Reporter: sivabalan narayanan


For a fresh table, when both FILES and RLI is enabled, we use default values 
for num file groups i.e 10 for RLI. and this also creates a log file and does 
not create a base file since there are no records to instantiate as such. So, 
we should defer the instantiation to later. either at the end of first commit 
or when the data table has atleast 1 completed commit. 

For an already existing table, this is not an issue since if there are valid 
records, we will dynamically determine the number of file groups. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5451) Ensure switching "001" and "002" suffix for compaction and cleaning in MDT is backwards compatible

2023-06-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-5451.
-
Resolution: Invalid

> Ensure switching "001" and "002" suffix for compaction and cleaning in MDT is 
> backwards compatible 
> ---
>
> Key: HUDI-5451
> URL: https://issues.apache.org/jira/browse/HUDI-5451
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.14.0
>
>
> as per master, we suffix, "001" for compaction and "002" for cleaning for 
> MDT. 
> but w/ record level index support, we are changing that. we are setting "001" 
> for new partition initialization, "002"f or compaction and "003" for cleaning.
> for newer tables its not an issue. but for an existing table, we need to 
> ensure its backwards compatible. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nsivabalan commented on a diff in pull request #8978: [HUDI-6315] Optimize DELETE codepath to use meta fields instead of key generation and index lookup

2023-06-26 Thread via GitHub


nsivabalan commented on code in PR #8978:
URL: https://github.com/apache/hudi/pull/8978#discussion_r1242926341


##
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/table/action/commit/FlinkDeletePreppedCommitActionExecutor.java:
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.commit;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.io.HoodieWriteHandle;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.hudi.table.action.HoodieWriteMetadata;
+
+import java.util.List;
+
+/**
+ * Flink upsert prepped commit action executor.
+ */
+public class FlinkDeletePreppedCommitActionExecutor extends 
BaseFlinkCommitActionExecutor {
+
+  private final List> preppedRecords;
+
+  public FlinkDeletePreppedCommitActionExecutor(HoodieEngineContext context,

Review Comment:
   Can you file a ticket or adding tests for delete prepped for flink. 
   for spark, lets add tests in this patch only. 



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/HoodieSparkMergeOnReadTable.java:
##
@@ -105,6 +106,11 @@ public HoodieWriteMetadata> 
delete(HoodieEngineContext c
 return new 
SparkDeleteDeltaCommitActionExecutor<>((HoodieSparkEngineContext) context, 
config, this, instantTime, keys).execute();
   }
 
+  @Override
+  public HoodieWriteMetadata> 
deletePrepped(HoodieEngineContext context, String instantTime, 
HoodieData> preppedRecords) {

Review Comment:
   my bad. thanks



##
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/HoodieSparkRecordMerger.java:
##
@@ -41,13 +42,30 @@ public Option> 
merge(HoodieRecord older, Schema oldSc
 ValidationUtils.checkArgument(older.getRecordType() == 
HoodieRecordType.SPARK);
 ValidationUtils.checkArgument(newer.getRecordType() == 
HoodieRecordType.SPARK);
 
-if (newer.getData() == null) {
-  // Delete record
-  return Option.empty();
+if (newer instanceof HoodieSparkRecord) {
+  HoodieSparkRecord newSparkRecord = (HoodieSparkRecord) newer;
+  if (newSparkRecord.isDeleted()) {
+// Delete record
+return Option.empty();
+  }
+} else {
+  if (newer.getData() == null) {

Review Comment:
   we need to understand whats going on in that test



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java:
##
@@ -247,6 +247,15 @@ public JavaRDD delete(JavaRDD 
keys, String instantTime)
 return postWrite(resultRDD, instantTime, table);
   }
 
+  @Override
+  public JavaRDD deletePrepped(JavaRDD> 
preppedRecord, String instantTime) {

Review Comment:
   we might need to add tests for this



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -349,9 +366,9 @@ object HoodieSparkSqlWriter {
 // Remove meta columns from writerSchema if isPrepped is true.
 val isPrepped = 
hoodieConfig.getBooleanOrDefault(DATASOURCE_WRITE_PREPPED_KEY, false)
 val processedDataSchema = if (isPrepped) {
-  HoodieAvroUtils.removeMetadataFields(writerSchema);
+  HoodieAvroUtils.removeMetadataFields(writerSchema)

Review Comment:
   guess this has to be 
   ```
   HoodieAvroUtils.removeMetadataFields(dataFileSchema) 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8609: [HUDI-6154] Introduced retry while reading hoodie.properties to deal with parallel updates.

2023-06-26 Thread via GitHub


hudi-bot commented on PR #8609:
URL: https://github.com/apache/hudi/pull/8609#issuecomment-1608387197

   
   ## CI report:
   
   * e14bd41edf6cc961d77087eea67f755f23590834 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17992)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18115)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] neerajpadarthi commented on issue #9050: [SUPPORT] Hudi Metadata BloomIndex stats failed (Failed to get the bloom filter)

2023-06-26 Thread via GitHub


neerajpadarthi commented on issue #9050:
URL: https://github.com/apache/hudi/issues/9050#issuecomment-1608383218

   Hey @ad1happy2go, thanks for checking. I have tested using 0.12V, it worked 
when the 1st and corresponding commits used 0.12v. 
   
   But the Ingestion failed when performing an upsert using 0.12v on the 0.11v 
dataset(The Initial dump was loaded using 0.11V). Is this an expected scenario? 
N can you also please let me know the process of migrating the datasets from 
0.11v to 0.12v. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] dineshbganesan closed issue #9024: Clustering is not picking all partitions

2023-06-26 Thread via GitHub


dineshbganesan closed issue #9024: Clustering is not picking all partitions
URL: https://github.com/apache/hudi/issues/9024


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] prashantwason commented on pull request #8609: [HUDI-6154] Introduced retry while reading hoodie.properties to deal with parallel updates.

2023-06-26 Thread via GitHub


prashantwason commented on PR #8609:
URL: https://github.com/apache/hudi/pull/8609#issuecomment-1608381677

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] kazdy commented on pull request #9056: [DOC] Add parquet blooms documentation

2023-06-26 Thread via GitHub


kazdy commented on PR #9056:
URL: https://github.com/apache/hudi/pull/9056#issuecomment-1608284752

   @parisni I think you need to add it to "current" docs version as well, if 
you want to have it copied over to 0.14 docs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] prashantwason commented on a diff in pull request #9037: [HUDI-6420] Fixing Hfile on-demand and prefix based reads to use optimized apis

2023-06-26 Thread via GitHub


prashantwason commented on code in PR #9037:
URL: https://github.com/apache/hudi/pull/9037#discussion_r1242686769


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java:
##
@@ -195,11 +193,6 @@ protected  ClosableIterator> 
lookupRecords(List keys,
 blockContentLoc.getContentPositionInLogFile(),
 blockContentLoc.getBlockSize());
 
-// HFile read will be efficient if keys are sorted, since on storage 
records are sorted by key.

Review Comment:
   Removing this means that if there is any code path (existing or introduced 
tomorrow) that does not sort the keys then we may have misses from the MDT. 
This could lead to data quality issues. 
   
   If we do not want to have the overhead of re-sorting a sorted array (how 
much is the overhead?) then we atleast need to add some checks here that the 
current keys is greater than the previous key in the getRecordsByKeysIterator 
and getRecordsByKeyPrefixIterator.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6445) Fix CI stability Jun 26, 2023

2023-06-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-6445:
--
Description: 
CI has been unstable for the past few weeks. we need to triage them and fix it.

 

 

UT-spark datasource module times out after 3 hours. 

[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=17956=logs=b1544eb9-7ff1-5db9-0187-3e05abf459bc]
 * Looks like top 10 tests were taking 30 to 40 secs and now its taking 40 to 
50 secs or more and hence reaching 3 hours

{code:java}
2023-06-20T05:03:58.6566739Z 52.124 
org.apache.hudi.functional.TestIncrementalReadWithFullTableScan 
testFailEarlyForIncrViewQue
ryForNonExistingFiles{HoodieTableType}[2]
2023-06-20T05:03:58.6567324Z 49.446 
org.apache.hudi.functional.TestIncrementalReadWithFullTableScan 
testFailEarlyForIncrViewQue
ryForNonExistingFiles{HoodieTableType}[1]
2023-06-20T05:03:58.6568005Z 48.659 
org.apache.hudi.functional.cdc.TestCDCDataFrameSuite 
testMORDataSourceWrite{HoodieCDCSupple
mentalLoggingMode}[1]
2023-06-20T05:03:58.6568471Z 47.799 
org.apache.hudi.functional.cdc.TestCDCDataFrameSuite 
testMORDataSourceWrite{HoodieCDCSupple
mentalLoggingMode}[3]
2023-06-20T05:03:58.6569093Z 47.586 
org.apache.hudi.functional.cdc.TestCDCDataFrameSuite 
testMORDataSourceWrite{HoodieCDCSupple
mentalLoggingMode}[2]
2023-06-20T05:03:58.6569503Z 41.208 
org.apache.hudi.functional.TestMORDataSource testCount{HoodieRecordType, 
HoodieRecordType, 
String}[2]
2023-06-20T05:03:58.6570090Z 41.034 
org.apache.hudi.functional.TestMORDataSource testCount{HoodieRecordType, 
HoodieRecordType, 
String}[4]
2023-06-20T05:03:58.6570501Z 40.225 
org.apache.hudi.functional.TestMORDataSource testCount{HoodieRecordType, 
HoodieRecordType, 
String}[3]
2023-06-20T05:03:58.6571231Z 39.853 
org.apache.hudi.functional.cdc.TestCDCDataFrameSuite 
testCOWDataSourceWrite{HoodieCDCSupple
mentalLoggingMode}[1]
2023-06-20T05:03:58.6574224Z 39.357 
org.apache.hudi.functional.TestMORDataSource testCount{HoodieRecordType, 
HoodieRecordType, 
String}[1]
2023-06-20T05:03:58.6575261Z 38.995 
org.apache.hudi.functional.cdc.TestCDCDataFrameSuite 
testCOWDataSourceWrite{HoodieCDCSupple
mentalLoggingMode}[3]
2023-06-20T05:03:58.6575765Z 38.846 
org.apache.hudi.functional.cdc.TestCDCDataFrameSuite 
testCOWDataSourceWrite{HoodieCDCSupple
mentalLoggingMode}[2]
2023-06-20T05:03:58.6576470Z 35.404 
org.apache.hudi.functional.TestMORDataSourceWithBucketIndex 
testCountWithBucketIndex {code}
 

TestHoodieDeltaStreamer.testUpsertsMORContinuousMode

and testAsyncClusteringServiceWithCompaction

[https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18111/logs/19]

 

TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWritersForConflicts

[https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18080/logs/21]

 

TestWriteMergeOnRead.testUpsert

[https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/17993/logs/35]

 

TestWriteMergeOnReadWithCompact.testUpsert

TestWriteCopyOnWrite.testSubtaskFails

[https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18110/logs/30]

 

 

 

  was:
CI has been unstable for the past few weeks. we need to triage them and fix it.

 

 

UT-spark datasource module times out after 3 hours. 

[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=17956=logs=b1544eb9-7ff1-5db9-0187-3e05abf459bc]

 

TestHoodieDeltaStreamer.testUpsertsMORContinuousMode

and testAsyncClusteringServiceWithCompaction

[https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18111/logs/19]

 

TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWritersForConflicts

[https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18080/logs/21]

 

TestWriteMergeOnRead.testUpsert

[https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/17993/logs/35]

 

TestWriteMergeOnReadWithCompact.testUpsert

TestWriteCopyOnWrite.testSubtaskFails

[https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18110/logs/30]

 

 

 


> Fix CI stability Jun 26, 2023
> -
>
> Key: HUDI-6445
> URL: https://issues.apache.org/jira/browse/HUDI-6445
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Major
>
> CI has been unstable for the past few weeks. we need to triage them and fix 
> it.
>  
>  
> UT-spark datasource module times out after 3 hours. 
> 

[GitHub] [hudi] guanziyue commented on pull request #9052: [HUDI-6439] DirectWriteMarkers create file need judge appendfile whether exist

2023-06-26 Thread via GitHub


guanziyue commented on PR #9052:
URL: https://github.com/apache/hudi/pull/9052#issuecomment-1607961507

   > @guanziyue Can you take a look at this PR, the background is when bucket 
index is used for Spark engine, the exception happens in very high odds, is 
there any good idea we can strength the usability?
   
   Thanks Danny. Got more info from author side, it seems to be the same 
problem which should be fixed by a previous PR. May wait for further feedback.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] guanziyue commented on pull request #9052: [HUDI-6439] DirectWriteMarkers create file need judge appendfile whether exist

2023-06-26 Thread via GitHub


guanziyue commented on PR #9052:
URL: https://github.com/apache/hudi/pull/9052#issuecomment-1607958298

   > > May I know if this still occur after 
[HUDI-6401](https://issues.apache.org/jira/browse/HUDI-6401) is merged? And if 
so, could you also share the stacktrace including HoodieWriteHandle code path?
   > 
   > yes,use the master branch,and fix like the current pr can fix it,error 
code path like above stack
   
   As we discussed offline, could you pls kindly have a try of 
[HUDI-6401](https://issues.apache.org/jira/browse/HUDI-6401)? Looking forward 
to your feedback!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6445) Fix CI stability Jun 26, 2023

2023-06-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-6445:
--
Description: 
CI has been unstable for the past few weeks. we need to triage them and fix it.

 

 

UT-spark datasource module times out after 3 hours. 

[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=17956=logs=b1544eb9-7ff1-5db9-0187-3e05abf459bc]

 

TestHoodieDeltaStreamer.testUpsertsMORContinuousMode

and testAsyncClusteringServiceWithCompaction

[https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18111/logs/19]

 

TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWritersForConflicts

[https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18080/logs/21]

 

TestWriteMergeOnRead.testUpsert

[https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/17993/logs/35]

 

TestWriteMergeOnReadWithCompact.testUpsert

TestWriteCopyOnWrite.testSubtaskFails

[https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18110/logs/30]

 

 

 

  was:
CI has been unstable for the past few weeks. we need to triage them and fix it.

 

 

UT-spark datasource module times out after 3 hours. 

[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=17956=logs=b1544eb9-7ff1-5db9-0187-3e05abf459bc]

 

TestHoodieDeltaStreamer.testUpsertsMORContinuousMode

and testAsyncClusteringServiceWithCompaction

[https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18111/logs/19]

 

TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWritersForConflicts

[https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18080/logs/21]

 

 


> Fix CI stability Jun 26, 2023
> -
>
> Key: HUDI-6445
> URL: https://issues.apache.org/jira/browse/HUDI-6445
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Major
>
> CI has been unstable for the past few weeks. we need to triage them and fix 
> it.
>  
>  
> UT-spark datasource module times out after 3 hours. 
> [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=17956=logs=b1544eb9-7ff1-5db9-0187-3e05abf459bc]
>  
> TestHoodieDeltaStreamer.testUpsertsMORContinuousMode
> and testAsyncClusteringServiceWithCompaction
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18111/logs/19]
>  
> TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWritersForConflicts
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18080/logs/21]
>  
> TestWriteMergeOnRead.testUpsert
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/17993/logs/35]
>  
> TestWriteMergeOnReadWithCompact.testUpsert
> TestWriteCopyOnWrite.testSubtaskFails
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18110/logs/30]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6445) Fix CI stability Jun 26, 2023

2023-06-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-6445:
--
Description: 
CI has been unstable for the past few weeks. we need to triage them and fix it.

 

 

UT-spark datasource module times out after 3 hours. 

[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=17956=logs=b1544eb9-7ff1-5db9-0187-3e05abf459bc]

 

TestHoodieDeltaStreamer.testUpsertsMORContinuousMode

and testAsyncClusteringServiceWithCompaction

[https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18111/logs/19]

 

TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWritersForConflicts

[https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18080/logs/21]

 

 

  was:CI has been unstable for the past few weeks. we need to triage them and 
fix it


> Fix CI stability Jun 26, 2023
> -
>
> Key: HUDI-6445
> URL: https://issues.apache.org/jira/browse/HUDI-6445
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Major
>
> CI has been unstable for the past few weeks. we need to triage them and fix 
> it.
>  
>  
> UT-spark datasource module times out after 3 hours. 
> [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=17956=logs=b1544eb9-7ff1-5db9-0187-3e05abf459bc]
>  
> TestHoodieDeltaStreamer.testUpsertsMORContinuousMode
> and testAsyncClusteringServiceWithCompaction
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18111/logs/19]
>  
> TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWritersForConflicts
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18080/logs/21]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6445) Fix CI stability Jun 26, 2023

2023-06-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-6445:
--
Epic Link: HUDI-4302

> Fix CI stability Jun 26, 2023
> -
>
> Key: HUDI-6445
> URL: https://issues.apache.org/jira/browse/HUDI-6445
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Major
>
> CI has been unstable for the past few weeks. we need to triage them and fix it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6445) Fix CI stability Jun 26, 2023

2023-06-26 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-6445:
-

 Summary: Fix CI stability Jun 26, 2023
 Key: HUDI-6445
 URL: https://issues.apache.org/jira/browse/HUDI-6445
 Project: Apache Hudi
  Issue Type: Test
  Components: tests-ci
Reporter: sivabalan narayanan


CI has been unstable for the past few weeks. we need to triage them and fix it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nbalajee commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …

2023-06-26 Thread via GitHub


nbalajee commented on code in PR #9035:
URL: https://github.com/apache/hudi/pull/9035#discussion_r1242484847


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -612,6 +612,20 @@ public class HoodieWriteConfig extends HoodieConfig {
   .sinceVersion("0.10.0")
   .withDocumentation("File Id Prefix provider class, that implements 
`org.apache.hudi.fileid.FileIdPrefixProvider`");
 
+  public static final ConfigProperty ENFORCE_COMPLETION_MARKER_CHECKS 
= ConfigProperty
+  .key("hoodie.markers.enforce.completion.checks")
+  .defaultValue("false")
+  .sinceVersion("0.10.0")
+  .withDocumentation("Prevents the creation of duplicate data files, when 
multiple spark tasks are racing to "
+  + "create data files and a completed data file is already present");
+
+  public static final ConfigProperty ENFORCE_FINALIZE_WRITE_CHECK = 
ConfigProperty
+  .key("hoodie.markers.enforce.finalize.write.check")
+  .defaultValue("false")
+  .sinceVersion("0.10.0")

Review Comment:
   will do.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nbalajee commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …

2023-06-26 Thread via GitHub


nbalajee commented on code in PR #9035:
URL: https://github.com/apache/hudi/pull/9035#discussion_r1242484399


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##
@@ -512,6 +512,7 @@ public List close() {
 status.getStat().setFileSizeInBytes(logFileSize);
   }
 
+  createCompletedMarkerFile(partitionPath, baseInstantTime);

Review Comment:
   Will update the diff, after adding this check.  (We have this enabled by 
default, it make sense to wrap it up with the flag for OSS).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nbalajee commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …

2023-06-26 Thread via GitHub


nbalajee commented on code in PR #9035:
URL: https://github.com/apache/hudi/pull/9035#discussion_r1242482645


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/TimelineServerBasedWriteMarkers.java:
##
@@ -132,6 +153,25 @@ public Set allMarkerFilePaths() {
 }
   }
 
+  @Override
+  public void createMarkerDir() throws HoodieIOException {
+HoodieTimer timer = new HoodieTimer().startTimer();
+Map paramsMap = new HashMap<>();
+paramsMap.put(MARKER_DIR_PATH_PARAM, markerDirPath.toString());

Review Comment:
   Currently, the Timeline server based markers is designed using this 
mechanism.  Mainly used for cloud based solutions. @nsivabalan @yihua  can add 
details. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nbalajee commented on pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …

2023-06-26 Thread via GitHub


nbalajee commented on PR #9035:
URL: https://github.com/apache/hudi/pull/9035#issuecomment-1607857303

   > Thanks for the contribution @nbalajee , In general I'm confused why we 
need two marker files for each base file, before the patch, we have in-progress 
marker file and write status real paths, we can diff out the corrupt/retry 
files by comparing the in-progress marker file handles and the paths recorded 
in writestatus.
   > 
   > And we also have some instant completion check in HoodieFileSystemView, to 
ignore the files/file blocks that are still pending, so why the reader view 
could read data sets that are not intented to be exposed?
   
   Thanks for your review @dannyhchen and @nsivabalan for the review.
   
   > Thanks for the contribution @nbalajee , In general I'm confused why we 
need two marker files for each base file, before the patch, we have in-progress 
marker file and write status real paths, we can diff out the corrupt/retry 
files by comparing the in-progress marker file handles and the paths recorded 
in writestatus.
   > 
   > And we also have some instant completion check in HoodieFileSystemView, to 
ignore the files/file blocks that are still pending, so why the reader view 
could read data sets that are not intented to be exposed?
   
   Following diagram summarizes the issue. 
   (a) when a batch of records given to an executor for writing, spills over to 
multiple data files (split into multiple parts due to file size limits, 
f1-0_w1_c1.parquet, f1-1_w1_c1.parquet etc)
   (b) A spark stage is retried as a result all tasks are retried (some of the 
tasks from previous attempts could still be on-going).  Mainly happens with 
spark fetchfailed exception.
   
   ![Screenshot 2023-06-25 at 9 15 35 
PM](https://github.com/apache/hudi/assets/47542891/7121d7e6-e624-4743-ad00-004fde3e8344)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nbalajee commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …

2023-06-26 Thread via GitHub


nbalajee commented on code in PR #9035:
URL: https://github.com/apache/hudi/pull/9035#discussion_r1242478376


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java:
##
@@ -138,9 +139,35 @@ protected Path makeNewFilePath(String partitionPath, 
String fileName) {
*
* @param partitionPath Partition path
*/
-  protected void createMarkerFile(String partitionPath, String dataFileName) {
-WriteMarkersFactory.get(config.getMarkersType(), hoodieTable, instantTime)
-.create(partitionPath, dataFileName, getIOType(), config, fileId, 
hoodieTable.getMetaClient().getActiveTimeline());
+  protected void createInProgressMarkerFile(String partitionPath, String 
dataFileName, String markerInstantTime) {
+WriteMarkers writeMarkers = 
WriteMarkersFactory.get(config.getMarkersType(), hoodieTable, instantTime);
+if (!writeMarkers.doesMarkerDirExist()) {

Review Comment:
   If we allow the markerDir to be created on a need basis, a stray executor 
starting to write to a file would create the directory after the finalize write 
and end up leaving a duplicate file.  
   
   By creating the markerDir at the time of startCommit() and deleting the 
directory at/after the finalizeWrite(),  we ensure that executors can't start a 
new write operation or successfully close an on-going write operation, if 
markerDir is missing (deleted by finalizeWrite).



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java:
##
@@ -138,9 +139,35 @@ protected Path makeNewFilePath(String partitionPath, 
String fileName) {
*
* @param partitionPath Partition path
*/
-  protected void createMarkerFile(String partitionPath, String dataFileName) {
-WriteMarkersFactory.get(config.getMarkersType(), hoodieTable, instantTime)
-.create(partitionPath, dataFileName, getIOType(), config, fileId, 
hoodieTable.getMetaClient().getActiveTimeline());
+  protected void createInProgressMarkerFile(String partitionPath, String 
dataFileName, String markerInstantTime) {
+WriteMarkers writeMarkers = 
WriteMarkersFactory.get(config.getMarkersType(), hoodieTable, instantTime);
+if (!writeMarkers.doesMarkerDirExist()) {
+  throw new HoodieIOException(String.format("Marker root directory absent 
: %s/%s (%s)",
+  partitionPath, dataFileName, markerInstantTime));
+}
+if (config.enforceFinalizeWriteCheck()
+&& writeMarkers.markerExists(writeMarkers.getCompletionMarkerPath("", 
"FINALIZE_WRITE", markerInstantTime, IOType.CREATE))) {
+  throw new HoodieCorruptedDataException("Reconciliation for instant " + 
instantTime + " is completed, job is trying to re-write the data files.");
+}
+if (config.enforceCompletionMarkerCheck()
+&& 
writeMarkers.markerExists(writeMarkers.getCompletionMarkerPath(partitionPath, 
fileId, markerInstantTime, getIOType( {
+  throw new HoodieIOException("Completed marker file exists for : " + 
dataFileName + " (" + instantTime + ")");
+}
+writeMarkers.create(partitionPath, dataFileName, getIOType());
+  }
+
+  // visible for testing
+  public void createCompletedMarkerFile(String partition, String 
markerInstantTime) throws IOException {
+try {
+  WriteMarkersFactory.get(config.getMarkersType(), hoodieTable, 
instantTime)
+  .createCompletionMarker(partition, fileId, markerInstantTime, 
getIOType(), true);
+} catch (Exception e) {
+  // Clean up the data file, if the marker is already present or marker 
directories don't exist.
+  Path partitionPath = 
FSUtils.getPartitionPath(hoodieTable.getMetaClient().getBasePath(), partition);

Review Comment:
   After the finalizeWrite and reconciling the files, we delete the 
markerDirectory.   If a stray executor were to complete the write operation and 
close the file after the reconcile step, it would find markerDirectory missing 
and would cleanup the datafile created.
   
   ![Screenshot 2023-06-25 at 9 15 35 
PM](https://github.com/apache/hudi/assets/47542891/f84e70f9-5f17-4454-8ff1-608c59056ef3)
   In the example, executor C trying to close the file, after finalizeWrite 
operation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nbalajee commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …

2023-06-26 Thread via GitHub


nbalajee commented on code in PR #9035:
URL: https://github.com/apache/hudi/pull/9035#discussion_r1242478130


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -901,6 +901,9 @@ private void startCommit(String instantTime, String 
actionType, HoodieTableMetaC
   metaClient.getActiveTimeline().createNewInstant(new 
HoodieInstant(HoodieInstant.State.REQUESTED, actionType,
   instantTime));
 }
+
+// populate marker directory for the commit.
+WriteMarkersFactory.get(config.getMarkersType(), createTable(config, 
hadoopConf), instantTime).createMarkerDir();

Review Comment:
   That is the current behavior, DoesMarkerDirExists() check ensures that an 
executor can't start/complete the write operation, after finalizeWrite().



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -612,6 +612,20 @@ public class HoodieWriteConfig extends HoodieConfig {
   .sinceVersion("0.10.0")
   .withDocumentation("File Id Prefix provider class, that implements 
`org.apache.hudi.fileid.FileIdPrefixProvider`");
 
+  public static final ConfigProperty ENFORCE_COMPLETION_MARKER_CHECKS 
= ConfigProperty
+  .key("hoodie.markers.enforce.completion.checks")
+  .defaultValue("false")
+  .sinceVersion("0.10.0")
+  .withDocumentation("Prevents the creation of duplicate data files, when 
multiple spark tasks are racing to "
+  + "create data files and a completed data file is already present");
+
+  public static final ConfigProperty ENFORCE_FINALIZE_WRITE_CHECK = 
ConfigProperty
+  .key("hoodie.markers.enforce.finalize.write.check")
+  .defaultValue("false")
+  .sinceVersion("0.10.0")
+  .withDocumentation("When WriteStatus obj is lost due to engine related 
failures, then recomputing would involve "
+  + "re-writing all the data files. When this check is enabled it 
would block the rewrite from happening.");

Review Comment:
   I will update the doc. 
   
   Context: This check was added to address the following scenario:
   (1) as part of the insert/upsert operation, a set of files have been created 
(p1/f1_w1_c1.parquet, p2/f2_w2_c1.parquet - corresponding to commit c1).
   (2) FinalizeWrite() successfully purged, files that were created, but not 
part of the writeStatus.
   (3) As part of completing the commit c1, we will update the MDT with 
fileListing and RLI metadata.   In order to update the record index, when 
iterating over the writeStatuses, if writeStatus RDD blocks are found to be 
missing, execution engine (spark) would re-trigger the write stage (to recreate 
the write statuses).
   
   Above flag is used to avoid rewriting all the files as part of stage retry 
(which is more likely to fail during the second attempt).  Instead, we fail the 
job so that next write attempt can be made in a new job (after any required 
resource tuning).   Not an issue for small/medium sized tables.   We have seen 
this only on large tables (> 50B+ records).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9054: [HUDI-6442] TestPartialUpdateForMergeInto should support mor

2023-06-26 Thread via GitHub


hudi-bot commented on PR #9054:
URL: https://github.com/apache/hudi/pull/9054#issuecomment-1607783798

   
   ## CI report:
   
   * 3819ebe617f8338430fc1d1058f7e3938a6770e8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18114)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] stp-pv commented on issue #9032: [SUPPORT] NPE when reading from a CDC-enabled table but written with BULK_INSERT

2023-06-26 Thread via GitHub


stp-pv commented on issue #9032:
URL: https://github.com/apache/hudi/issues/9032#issuecomment-1607785019

   We are seeing the problem with insert as well. Here is the most simple fix 
for the problem we are observing:
   
   ```diff
   diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/HoodieCDCRDD.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/HoodieCDCRDD.scala
   index b42e6f8800..a0531772db 100644
   --- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/HoodieCDCRDD.scala
   +++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/HoodieCDCRDD.scala
   @@ -561,7 +561,7 @@ class HoodieCDCRDD(
  originTableSchema.structTypeSchema.zipWithIndex.foreach {
case (field, idx) =>
  if (field.dataType.isInstanceOf[StringType]) {
   -map(field.name) = record.getString(idx)
   +map(field.name) = 
Option(record.getUTF8String(idx)).map(_.toString).orNull
  } else {
map(field.name) = record.get(idx, field.dataType)
  }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ad1happy2go commented on issue #9032: [SUPPORT] NPE when reading from a CDC-enabled table but written with BULK_INSERT

2023-06-26 Thread via GitHub


ad1happy2go commented on issue #9032:
URL: https://github.com/apache/hudi/issues/9032#issuecomment-1607770606

   @zaza Thanks for the information. I am able to reproduce it with values null 
in one of the column. Also confirmed this is only happening with bulk_insert. I 
will check with master code once and then create a JIRA to fix it if its still 
the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9037: [HUDI-6420] Fixing Hfile on-demand and prefix based reads to use optimized apis

2023-06-26 Thread via GitHub


nsivabalan commented on code in PR #9037:
URL: https://github.com/apache/hudi/pull/9037#discussion_r1242392462


##
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java:
##
@@ -206,7 +198,12 @@ protected ClosableIterator 
getIndexedRecordIterator(Schema reader
 }
 
 // TODO eval whether seeking scanner would be faster than pread
-HFileScanner scanner = getHFileScanner(reader, false);
+HFileScanner scanner = null;
+try {
+  scanner = getHFileScanner(reader, false, false);
+} catch (IOException e) {
+  throw new HoodieIOException("Instantiation HfileScanner failed for " + 
reader.getHFileInfo().toString());
+}

Review Comment:
   every other method in the interface throws IOException except this method. 
So, I let it as is. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] parisni opened a new pull request, #9056: [DOC] Add parquet blooms documentation

2023-06-26 Thread via GitHub


parisni opened a new pull request, #9056:
URL: https://github.com/apache/hudi/pull/9056

   ### Change Logs
   
   This adds doc for the parquet bloom feature. I added it in 0.13.1, but this 
likely should be moved to 0.14 
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-5447) Add support for Record level index read from MDT

2023-06-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-5447.
-
Resolution: Fixed

> Add support for Record level index read from MDT
> 
>
> Key: HUDI-5447
> URL: https://issues.apache.org/jira/browse/HUDI-5447
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> introduce a new index which will leverage record level index partition in MDT 
> and assist in tag locations. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5446) Add support to write record level index to MDT

2023-06-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-5446.
-
Resolution: Fixed

> Add support to write record level index to MDT
> --
>
> Key: HUDI-5446
> URL: https://issues.apache.org/jira/browse/HUDI-5446
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Add support to write our record level index partition to MDT



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5444) FileNotFound issue w/ metadata enabled

2023-06-26 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-5444.
-
Resolution: Invalid

> FileNotFound issue w/ metadata enabled
> --
>
> Key: HUDI-5444
> URL: https://issues.apache.org/jira/browse/HUDI-5444
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.14.0
>
>
> stacktrace
> {code:java}
> Caused by: java.io.FileNotFoundException: File not found: 
> gs://TBL_PATH/op_cmpny_cd=WMT.COM/order_placed_dt=2022-12-08/441e7909-6a62-45ac-b9df-dd0386574f52-0_607-17-2433_20221208132316380.parquet
>         at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1082)
>  {code}
>  
> 20221208133227028 (RB_C10)
> 20221208133227028001 MDT compaction
> 20221208132316380 (C10)
> 20221208133647230
> DT
>  8   │ 20221202234515099 │ rollback │ COMPLETED │ Rolls back        │ 12-02 
> 15:45:18 │ 12-02 15:45:18 │ 12-02 15:45:33 ║
> ║     │                   │          │           │ 2022120413756 │        
>         │                │                ║
> ╟─┼───┼──┼───┼───┼┼┼╢
> ║ 9   │ 20221208133227028 │ rollback │ COMPLETED │ Rolls back        │ 12-08 
> 05:32:33 │ 12-08 05:32:33 │ 12-08 05:32:44 ║
> ║     │                   │          │           │ 20221208132316380 │        
>         │                │                ║
> ╟─┼───┼──┼───┼───┼┼┼╢
> ║ 10  │ 20221208133647230 │ rollback │ COMPLETED │ Rolls back        │ 12-08 
> 05:36:47 │ 12-08 05:36:48 │ 12-08 05:36:57 ║
> ║     │                   │          │           │ 20221208133222583 │        
>         │                │                ║
> ╟─┼───┼──┼───┼───┼┼┼╢
> MDT timeline: 
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:32 
> 20221208133227028.deltacommit.requested
> -rw-r--r--@ 1 nsb  staff   548 Dec  8 05:32 
> 20221208133227028.deltacommit.inflight
> -rw-r--r--@ 1 nsb  staff  6042 Dec  8 05:32 20221208133227028.deltacommit
> -rw-r--r--@ 1 nsb  staff  1938 Dec  8 05:34 
> 20221208133227028001.compaction.requested
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:34 
> 20221208133227028001.compaction.inflight
> -rw-r--r--@ 1 nsb  staff  7556 Dec  8 05:34 20221208133227028001.commit
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:34 
> 20221208132316380.deltacommit.requested
> -rw-r--r--@ 1 nsb  staff  3049 Dec  8 05:34 
> 20221208132316380.deltacommit.inflight
> -rw-r--r--@ 1 nsb  staff  8207 Dec  8 05:35 20221208132316380.deltacommit
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:36 
> 20221208133647230.deltacommit.requested
> -rw-r--r--@ 1 nsb  staff   548 Dec  8 05:36 
> 20221208133647230.deltacommit.inflight
> -rw-r--r--@ 1 nsb  staff  6042 Dec  8 05:36 20221208133647230.deltacommit
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6443) Support insert_overwrite and insert_overwrite_table with RLI

2023-06-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6443:
-
Labels: pull-request-available  (was: )

> Support insert_overwrite and insert_overwrite_table with RLI
> 
>
> Key: HUDI-6443
> URL: https://issues.apache.org/jira/browse/HUDI-6443
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: index, metadata
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] xushiyan opened a new pull request, #9055: [HUDI-6443] Support insert_overwrite/table with record-level index

2023-06-26 Thread via GitHub


xushiyan opened a new pull request, #9055:
URL: https://github.com/apache/hudi/pull/9055

   ### Change Logs
   
   Support `insert_overwrite` and `insert_overwrite_table` with record-level 
index. The metadata records should be updated accordingly.
   
   - newly inserted records should be present in RLI
   - old records in the affected partitions should be removed from RLI
   - old records that happen to have the same record key as the new inserts 
won't be removed from RLI; their entries will be updated
   
   ### Impact
   
   RLI data integrity
   
   ### Risk level
   
   Medium
   
   - [ ] UT, FT and e2e testing.
   
   ### Documentation Update
   
   NA
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zaza commented on issue #9032: [SUPPORT] NPE when reading from a CDC-enabled table but written with BULK_INSERT

2023-06-26 Thread via GitHub


zaza commented on issue #9032:
URL: https://github.com/apache/hudi/issues/9032#issuecomment-1607596535

   Hi @ad1happy2go thanks for giving it a go. I followed your setup and it did 
work for me as well. After taking a deeper dive into our tables and what's in 
them we realized some of our records have _null values_ (with the field marked 
as nullable in the schema). It doesn't seem like any of the records from 
DataGenerator have empty fields, but would you mind trying your example with 
that in mind? 
   
   Once confirmed that null values are the culprit here I will update the 
summary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5071: [HUDI-1881]: draft implementation for trigger based on data availability

2023-06-26 Thread via GitHub


hudi-bot commented on PR #5071:
URL: https://github.com/apache/hudi/pull/5071#issuecomment-1607567423

   
   ## CI report:
   
   * b7203e6d2d6f1e8d3121024faedfa2da1ccc0c71 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7088)
 
   * 518758403252fd03ca77eb8977dda217575efecc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6443) Support insert_overwrite and insert_overwrite_table with RLI

2023-06-26 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-6443:
-
Priority: Blocker  (was: Major)

> Support insert_overwrite and insert_overwrite_table with RLI
> 
>
> Key: HUDI-6443
> URL: https://issues.apache.org/jira/browse/HUDI-6443
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: index, metadata
>Reporter: Raymond Xu
>Priority: Blocker
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6443) Support insert_overwrite and insert_overwrite_table with RLI

2023-06-26 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-6443:


Assignee: Raymond Xu

> Support insert_overwrite and insert_overwrite_table with RLI
> 
>
> Key: HUDI-6443
> URL: https://issues.apache.org/jira/browse/HUDI-6443
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: index, metadata
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6443) Support insert_overwrite and insert_overwrite_table with RLI

2023-06-26 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-6443:
-
Fix Version/s: 0.14.0

> Support insert_overwrite and insert_overwrite_table with RLI
> 
>
> Key: HUDI-6443
> URL: https://issues.apache.org/jira/browse/HUDI-6443
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: index, metadata
>Reporter: Raymond Xu
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6444) Support delete and delete_partition with RLI

2023-06-26 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-6444:


 Summary: Support delete and delete_partition with RLI
 Key: HUDI-6444
 URL: https://issues.apache.org/jira/browse/HUDI-6444
 Project: Apache Hudi
  Issue Type: Improvement
  Components: index, metadata
Reporter: Raymond Xu
 Fix For: 0.14.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6443) Support insert_overwrite and insert_overwrite_table with RLI

2023-06-26 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-6443:


 Summary: Support insert_overwrite and insert_overwrite_table with 
RLI
 Key: HUDI-6443
 URL: https://issues.apache.org/jira/browse/HUDI-6443
 Project: Apache Hudi
  Issue Type: Improvement
  Components: index, metadata
Reporter: Raymond Xu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #9049: [HUDI-6435] Add some logs to the updateHeartbeat method

2023-06-26 Thread via GitHub


hudi-bot commented on PR #9049:
URL: https://github.com/apache/hudi/pull/9049#issuecomment-1607556925

   
   ## CI report:
   
   * 7b7a50f0d0f4abd5fb36c06bd078ccef1b1cb76d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18113)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-6369) Spacial curve with sample strategy fails when 0 or 1 rows only is incoming

2023-06-26 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris reassigned HUDI-6369:
---

Assignee: nicolas paris

> Spacial curve with sample strategy fails when 0 or 1 rows only is incoming
> --
>
> Key: HUDI-6369
> URL: https://issues.apache.org/jira/browse/HUDI-6369
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Assignee: nicolas paris
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Github Issue - [https://github.com/apache/hudi/issues/8934]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #9053: [HUDI-6369] Fix spacial curve with sample strategy fails when 0 or 1 rows only is incoming

2023-06-26 Thread via GitHub


hudi-bot commented on PR #9053:
URL: https://github.com/apache/hudi/pull/9053#issuecomment-1607440001

   
   ## CI report:
   
   * bf5569721d0a4d7019d1897c3af941031c3a3d30 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18112)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6438) Fix issue while inserting non-nullable array columns to nullable columns

2023-06-26 Thread Aditya Goenka (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya Goenka updated HUDI-6438:

Priority: Critical  (was: Major)

> Fix issue while inserting non-nullable array columns to nullable columns
> 
>
> Key: HUDI-6438
> URL: https://issues.apache.org/jira/browse/HUDI-6438
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Priority: Critical
> Fix For: 0.14.0
>
>
> Github issue - [https://github.com/apache/hudi/issues/9042]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #9054: [HUDI-6442] TestPartialUpdateForMergeInto should support mor

2023-06-26 Thread via GitHub


hudi-bot commented on PR #9054:
URL: https://github.com/apache/hudi/pull/9054#issuecomment-1607351428

   
   ## CI report:
   
   * 3819ebe617f8338430fc1d1058f7e3938a6770e8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18114)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9054: [HUDI-6442] TestPartialUpdateForMergeInto should support mor

2023-06-26 Thread via GitHub


hudi-bot commented on PR #9054:
URL: https://github.com/apache/hudi/pull/9054#issuecomment-1607338553

   
   ## CI report:
   
   * 3819ebe617f8338430fc1d1058f7e3938a6770e8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7343: [HUDI-5303] Allow users to control the concurrency to submit jobs in clustering

2023-06-26 Thread via GitHub


hudi-bot commented on PR #7343:
URL: https://github.com/apache/hudi/pull/7343#issuecomment-1607315899

   
   ## CI report:
   
   * 372cdaea808b0e17ef4868323a673dc3a15be1aa Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18107)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6442) TestPartialUpdateForMergeInto should support mor

2023-06-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6442:
-
Labels: pull-request-available  (was: )

> TestPartialUpdateForMergeInto should support mor
> 
>
> Key: HUDI-6442
> URL: https://issues.apache.org/jira/browse/HUDI-6442
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: tests-ci
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] ad1happy2go commented on issue #9032: [SUPPORT] NPE when reading from a CDC-enabled table but written with BULK_INSERT

2023-06-26 Thread via GitHub


ad1happy2go commented on issue #9032:
URL: https://github.com/apache/hudi/issues/9032#issuecomment-1607307441

   @zaza Also, can you share your full table configuration. That might help me 
to reproduce this error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xuzifu666 opened a new pull request, #9054: [HUDI-6442] TestPartialUpdateForMergeInto should support mor

2023-06-26 Thread via GitHub


xuzifu666 opened a new pull request, #9054:
URL: https://github.com/apache/hudi/pull/9054

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   none
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   none
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   TestPartialUpdateForMergeInto should support mor
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6442) TestPartialUpdateForMergeInto should support mor

2023-06-26 Thread xy (Jira)
xy created HUDI-6442:


 Summary: TestPartialUpdateForMergeInto should support mor
 Key: HUDI-6442
 URL: https://issues.apache.org/jira/browse/HUDI-6442
 Project: Apache Hudi
  Issue Type: Bug
  Components: tests-ci
Reporter: xy
Assignee: xy
 Fix For: 0.14.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] ad1happy2go commented on issue #9032: [SUPPORT] NPE when reading from a CDC-enabled table but written with BULK_INSERT

2023-06-26 Thread via GitHub


ad1happy2go commented on issue #9032:
URL: https://github.com/apache/hudi/issues/9032#issuecomment-1607304355

   @zaza I tried to reproduce the issue with hudi 0.13.1, but I am seeing the 
appropriate behaviour only for bulk insert too. All CDC rows are coming as 
inserts which is expected for bulk insert.
   
   Can you let me know on exactly what scenario you are getting 
NullPointerException? Is it intermittent?
   
   Code I tried - 
   ```
   val path="file:///tmp/output/issue_9032_4"
   
   val dataGen = new DataGenerator
   val inserts = convertToStringList(dataGen.generateInserts(10))
   val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
   
   
   val options = Map(
   "hoodie.table.name" -> "line_items",
   "hoodie.datasource.write.recordkey.field" -> "uuid",
   "hoodie.datasource.write.precombine.field" -> "ts",
   "hoodie.datasource.write.partitionpath.field" -> "partitionpath",
   "hoodie.parquet.max.file.size" -> "125829120",
   "hoodie.parquet.small.file.limit" -> "104857600",
   "hoodie.index.type" -> "BLOOM",
   "hoodie.bloom.index.use.metadata" -> "true",
   "hoodie.cleaner.policy" -> "KEEP_LATEST_COMMITS",
   "hoodie.cleaner.commits.retained" -> "168",
   "hoodie.keep.min.commits" -> "173",
   "hoodie.keep.max.commits" -> "174"
   )
   
   
df.write.format("hudi").options(options).option(DataSourceWriteOptions.OPERATION.key,
 "bulk_insert").option(HoodieTableConfig.NAME.key(), "line_items")
   .option(HoodieTableConfig.CDC_ENABLED.key, "true")
   .option(HoodieTableConfig.CDC_SUPPLEMENTAL_LOGGING_MODE.key, 
HoodieCDCSupplementalLoggingMode.data_before_after.name())
   .mode("append")
   .save(path);
   
   
spark.readStream.format("hudi").option("hoodie.datasource.query.incremental.format",
 "cdc").option("hoodie.datasource.query.type", "incremental")
   .load(path)
   .writeStream.foreachBatch { (batch: Dataset[Row], _: 
Long) => batch.show(false); }.start.awaitTermination;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] joe-shad commented on pull request #5071: [HUDI-1881]: draft implementation for trigger based on data availability

2023-06-26 Thread via GitHub


joe-shad commented on PR #5071:
URL: https://github.com/apache/hudi/pull/5071#issuecomment-1607260369

   I'm waiting for this PR (or any possible solution to the continuous mode for 
MultiTableDeltaStreamer) as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9049: [HUDI-6435] Add some logs to the updateHeartbeat method

2023-06-26 Thread via GitHub


hudi-bot commented on PR #9049:
URL: https://github.com/apache/hudi/pull/9049#issuecomment-1607256682

   
   ## CI report:
   
   * 8960860b33c4b0a0016d8ee718525cb58f0a6959 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18099)
 
   * 7b7a50f0d0f4abd5fb36c06bd078ccef1b1cb76d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18113)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9049: [HUDI-6435] Add some logs to the updateHeartbeat method

2023-06-26 Thread via GitHub


hudi-bot commented on PR #9049:
URL: https://github.com/apache/hudi/pull/9049#issuecomment-1607235976

   
   ## CI report:
   
   * 8960860b33c4b0a0016d8ee718525cb58f0a6959 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18099)
 
   * 7b7a50f0d0f4abd5fb36c06bd078ccef1b1cb76d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ad1happy2go commented on issue #8919: [SUPPORT] Hudi Stored Procedure show clustering fails on AWS Glue 4.0

2023-06-26 Thread via GitHub


ad1happy2go commented on issue #8919:
URL: https://github.com/apache/hudi/issues/8919#issuecomment-1607224457

   @soumilshah1995 I am able to successfully run clustering with your code. The 
third block for show clustering fails as expected as it tried to find the table 
name and we are passing the path.
   
   Can you clarify more when are you seeing this error - 
java.util.NoSuchElementException: No value present in Option. I didn't hit this 
error with Glue 4.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan closed issue #8834: [SUPPORT] Push Hudi Commit Notification TO HTTP URI with Callback | Passing Custom Headers ?

2023-06-26 Thread via GitHub


xushiyan closed issue #8834: [SUPPORT] Push Hudi Commit Notification TO HTTP 
URI with Callback | Passing Custom Headers ?
URL: https://github.com/apache/hudi/issues/8834


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ad1happy2go commented on issue #8834: [SUPPORT] Push Hudi Commit Notification TO HTTP URI with Callback | Passing Custom Headers ?

2023-06-26 Thread via GitHub


ad1happy2go commented on issue #8834:
URL: https://github.com/apache/hudi/issues/8834#issuecomment-1607183534

   @soumilshah1995 Thanks for raising this. Hudi dont have anyway to pass 
custom header as of moment.
   
   Created JIRA to track this improvement - 
https://issues.apache.org/jira/browse/HUDI-6441


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6441) Passing custom Headers with Hudi Callback URL

2023-06-26 Thread Aditya Goenka (Jira)
Aditya Goenka created HUDI-6441:
---

 Summary: Passing custom Headers with Hudi Callback URL
 Key: HUDI-6441
 URL: https://issues.apache.org/jira/browse/HUDI-6441
 Project: Apache Hudi
  Issue Type: Improvement
  Components: writer-core
Reporter: Aditya Goenka
 Fix For: 1.0.0


Hudi callback URL's doesn't support passing the custom headers as of now. 
Implement a way to pass them and use it for callback.

Github Issue - [https://github.com/apache/hudi/issues/8834]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] guanziyue commented on pull request #9052: [HUDI-6439] DirectWriteMarkers create file need judge appendfile whether exist

2023-06-26 Thread via GitHub


guanziyue commented on PR #9052:
URL: https://github.com/apache/hudi/pull/9052#issuecomment-1607173922

   May I know if this still occur after HUDI-6401 is merged? And if so, could 
you also share the stacktrace including HoodieWriteHandle??


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant

2023-06-26 Thread via GitHub


danny0405 commented on code in PR #9038:
URL: https://github.com/apache/hudi/pull/9038#discussion_r1241963611


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java:
##
@@ -267,6 +267,13 @@ public HoodieTimeline getCommitsTimeline() {
 return getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, 
DELTA_COMMIT_ACTION, REPLACE_COMMIT_ACTION));
   }
 
+  /**
+   * Get all instants (commits, delta commits, replace, compaction) that 
produce new data or merge file, in the active timeline.
+   */
+  public HoodieTimeline getCommitsAndMergesTimeline() {
+return getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, 
DELTA_COMMIT_ACTION, REPLACE_COMMIT_ACTION, COMPACTION_ACTION));
+  }

Review Comment:
   getCommitsAndMergesTimeline -> getCommitsAndCompactionTimeline
   
   Can we also add a test case for this incremental cleaning scenario, where 
partition path got switched and the old partition files could not be cleaned.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant

2023-06-26 Thread via GitHub


danny0405 commented on code in PR #9038:
URL: https://github.com/apache/hudi/pull/9038#discussion_r1241963611


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java:
##
@@ -267,6 +267,13 @@ public HoodieTimeline getCommitsTimeline() {
 return getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, 
DELTA_COMMIT_ACTION, REPLACE_COMMIT_ACTION));
   }
 
+  /**
+   * Get all instants (commits, delta commits, replace, compaction) that 
produce new data or merge file, in the active timeline.
+   */
+  public HoodieTimeline getCommitsAndMergesTimeline() {
+return getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, 
DELTA_COMMIT_ACTION, REPLACE_COMMIT_ACTION, COMPACTION_ACTION));
+  }

Review Comment:
   getCommitsAndMergesTimeline -> getCommitsAndCompactionTimeline



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ad1happy2go commented on issue #8984: Offline compaction schedule failing with Error fetching partition paths from metadata table

2023-06-26 Thread via GitHub


ad1happy2go commented on issue #8984:
URL: https://github.com/apache/hudi/issues/8984#issuecomment-1607160179

   @koochiswathiTR  I dont think there is something like that which unschedule 
the compaction. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9053: [HUDI-6369] Fix spacial curve with sample strategy fails when 0 or 1 rows only is incoming

2023-06-26 Thread via GitHub


hudi-bot commented on PR #9053:
URL: https://github.com/apache/hudi/pull/9053#issuecomment-1607157136

   
   ## CI report:
   
   * bf5569721d0a4d7019d1897c3af941031c3a3d30 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18112)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan closed issue #8906: [SUPPORT] hudi upsert error: java.lang.NumberFormatException: For input string: "d880d4ea"

2023-06-26 Thread via GitHub


xushiyan closed issue #8906: [SUPPORT] hudi upsert error: 
java.lang.NumberFormatException: For input string: "d880d4ea"
URL: https://github.com/apache/hudi/issues/8906


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ad1happy2go commented on issue #8906: [SUPPORT] hudi upsert error: java.lang.NumberFormatException: For input string: "d880d4ea"

2023-06-26 Thread via GitHub


ad1happy2go commented on issue #8906:
URL: https://github.com/apache/hudi/issues/8906#issuecomment-1607152244

   @zyclove Looks like we have another one for tracking similar one - 
https://github.com/apache/hudi/issues/8986.
   
   Closing this one. Let us know in case of any concerns.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >