[GitHub] [hudi] YannByron commented on issue #9032: [SUPPORT] NPE when reading from a CDC-enabled table with null values
YannByron commented on issue #9032: URL: https://github.com/apache/hudi/issues/9032#issuecomment-1608757129 @stp-pv nice fix. `map(field.name) = record.get(idx, field.dataType)` is there possible the same problem? can you also fake a case to test this? And then you can fix them together. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhuanshenbsj1 closed pull request #9047: [Hudi 6422] Solve the issues of compiling dependency on Hadoop 3.1.1
zhuanshenbsj1 closed pull request #9047: [Hudi 6422] Solve the issues of compiling dependency on Hadoop 3.1.1 URL: https://github.com/apache/hudi/pull/9047 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …
danny0405 commented on code in PR #9035: URL: https://github.com/apache/hudi/pull/9035#discussion_r1243097851 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java: ## @@ -138,9 +139,35 @@ protected Path makeNewFilePath(String partitionPath, String fileName) { * * @param partitionPath Partition path */ - protected void createMarkerFile(String partitionPath, String dataFileName) { -WriteMarkersFactory.get(config.getMarkersType(), hoodieTable, instantTime) -.create(partitionPath, dataFileName, getIOType(), config, fileId, hoodieTable.getMetaClient().getActiveTimeline()); + protected void createInProgressMarkerFile(String partitionPath, String dataFileName, String markerInstantTime) { +WriteMarkers writeMarkers = WriteMarkersFactory.get(config.getMarkersType(), hoodieTable, instantTime); +if (!writeMarkers.doesMarkerDirExist()) { + throw new HoodieIOException(String.format("Marker root directory absent : %s/%s (%s)", + partitionPath, dataFileName, markerInstantTime)); +} +if (config.enforceFinalizeWriteCheck() +&& writeMarkers.markerExists(writeMarkers.getCompletionMarkerPath("", "FINALIZE_WRITE", markerInstantTime, IOType.CREATE))) { + throw new HoodieCorruptedDataException("Reconciliation for instant " + instantTime + " is completed, job is trying to re-write the data files."); +} +if (config.enforceCompletionMarkerCheck() +&& writeMarkers.markerExists(writeMarkers.getCompletionMarkerPath(partitionPath, fileId, markerInstantTime, getIOType( { + throw new HoodieIOException("Completed marker file exists for : " + dataFileName + " (" + instantTime + ")"); +} +writeMarkers.create(partitionPath, dataFileName, getIOType()); + } + + // visible for testing + public void createCompletedMarkerFile(String partition, String markerInstantTime) throws IOException { +try { + WriteMarkersFactory.get(config.getMarkersType(), hoodieTable, instantTime) + .createCompletionMarker(partition, fileId, markerInstantTime, getIOType(), true); +} catch (Exception e) { + // Clean up the data file, if the marker is already present or marker directories don't exist. + Path partitionPath = FSUtils.getPartitionPath(hoodieTable.getMetaClient().getBasePath(), partition); Review Comment: The literals are illegible to see clearly, would try to understand the workflow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9049: [HUDI-6435] Add some logs to the updateHeartbeat method
hudi-bot commented on PR #9049: URL: https://github.com/apache/hudi/pull/9049#issuecomment-1608702766 ## CI report: * 7b7a50f0d0f4abd5fb36c06bd078ccef1b1cb76d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18113) * 04ef037a6fa3652fa98638c2442e4081c327dae9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18118) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …
danny0405 commented on code in PR #9035: URL: https://github.com/apache/hudi/pull/9035#discussion_r1243093277 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -612,6 +612,20 @@ public class HoodieWriteConfig extends HoodieConfig { .sinceVersion("0.10.0") .withDocumentation("File Id Prefix provider class, that implements `org.apache.hudi.fileid.FileIdPrefixProvider`"); + public static final ConfigProperty ENFORCE_COMPLETION_MARKER_CHECKS = ConfigProperty + .key("hoodie.markers.enforce.completion.checks") + .defaultValue("false") + .sinceVersion("0.10.0") + .withDocumentation("Prevents the creation of duplicate data files, when multiple spark tasks are racing to " + + "create data files and a completed data file is already present"); + + public static final ConfigProperty ENFORCE_FINALIZE_WRITE_CHECK = ConfigProperty + .key("hoodie.markers.enforce.finalize.write.check") + .defaultValue("false") + .sinceVersion("0.10.0") + .withDocumentation("When WriteStatus obj is lost due to engine related failures, then recomputing would involve " + + "re-writing all the data files. When this check is enabled it would block the rewrite from happening."); Review Comment: > if writeStatus RDD blocks are found to be missing, execution engine (spark) would re-trigger the write stage (to recreate the write statuses). It seems a Spark engine specific issue? But here we put the fix in the writer code which could affect all the engines. May I know why the writeStatus RDD blocks could be missing here, can we persist it before commiting to the MDT ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9057: [HUDI-6446] Fixing MDT commit time parsing and RLI instantiation with MDT
hudi-bot commented on PR #9057: URL: https://github.com/apache/hudi/pull/9057#issuecomment-1608689486 ## CI report: * aea6f0bb6a55c8019f34cf9b328abef34f0a5f01 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18116) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9049: [HUDI-6435] Add some logs to the updateHeartbeat method
hudi-bot commented on PR #9049: URL: https://github.com/apache/hudi/pull/9049#issuecomment-1608689384 ## CI report: * 7b7a50f0d0f4abd5fb36c06bd078ccef1b1cb76d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18113) * 04ef037a6fa3652fa98638c2442e4081c327dae9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-5884) Support bulk_insert for insert_overwrite and insert_overwrite_table
[ https://issues.apache.org/jira/browse/HUDI-5884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hui An updated HUDI-5884: - Fix Version/s: 0.14.0 > Support bulk_insert for insert_overwrite and insert_overwrite_table > --- > > Key: HUDI-5884 > URL: https://issues.apache.org/jira/browse/HUDI-5884 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Hui An >Assignee: Hui An >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5692) SpillableMapBasePath should be lazily loaded
[ https://issues.apache.org/jira/browse/HUDI-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hui An updated HUDI-5692: - Fix Version/s: 0.14.0 (was: 0.13.0) > SpillableMapBasePath should be lazily loaded > > > Key: HUDI-5692 > URL: https://issues.apache.org/jira/browse/HUDI-5692 > Project: Apache Hudi > Issue Type: Bug >Reporter: Hui An >Assignee: Hui An >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > If we use {{withInferFunction}} to set the default value of > {{{}SPILLABLE_MAP_BASE_PATH{}}}, this default value will be set to > {{{}HoodieWriteConfig{}}}'s {{{}properties{}}}, and will be serialized to all > executors. This could introduce the issue that if the driver doesn't have the > same temporary location with the executors side(e.g. driver: /mnt/disk1, > executor: /mnt/disk2), the executor would throw error to create the spilled > map path(since the executor machine doesn't have the directory /mnt/disk1). > {code:java} > Caused by: org.apache.hudi.exception.HoodieIOException: Unable to create > :/mnt/ssd/0/yarn/nm-local-dir/usercache/test/appcache/application_1673593627114_3970647/hudi-BITCASK-e3741235-6571-4112-8b20-271408148238 > at > org.apache.hudi.common.util.collection.ExternalSpillableMap.getDiskBasedMap(ExternalSpillableMap.java:119) > at > org.apache.hudi.common.util.collection.ExternalSpillableMap.getDiskBasedMapNumEntries(ExternalSpillableMap.java:138) > at org.apache.hudi.io.HoodieMergeHandle.init(HoodieMergeHandle.java:268) > at org.apache.hudi.io.HoodieMergeHandle.(HoodieMergeHandle.java:129) > at org.apache.hudi.io.HoodieMergeHandle.(HoodieMergeHandle.java:121) > at org.apache.hudi.io.HoodieConcatHandle.(HoodieConcatHandle.java:81) > at > org.apache.hudi.io.HoodieMergeHandleFactory.create(HoodieMergeHandleFactory.java:60) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.getUpdateHandle(BaseSparkCommitActionExecutor.java:386) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:363) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:330) > ... 29 more > Caused by: java.io.IOException: Unable to create > :/mnt/ssd/0/yarn/nm-local-dir/usercache/test/appcache/application_1673593627114_3970647/hudi-BITCASK-e3741235-6571-4112-8b20-271408148238 > at org.apache.hudi.common.util.FileIOUtils.mkdir(FileIOUtils.java:70) > at org.apache.hudi.common.util.collection.DiskMap.(DiskMap.java:55) > at > org.apache.hudi.common.util.collection.BitCaskDiskMap.(BitCaskDiskMap.java:98) > at > org.apache.hudi.common.util.collection.ExternalSpillableMap.getDiskBasedMap(ExternalSpillableMap.java:116) > ... 38 more > > {code} > A better solution is to calculate the temporary location when calling > {{getSpillableMapBasePath}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] xuzifu666 closed pull request #9054: [HUDI-6442] TestPartialUpdateForMergeInto should support mor
xuzifu666 closed pull request #9054: [HUDI-6442] TestPartialUpdateForMergeInto should support mor URL: https://github.com/apache/hudi/pull/9054 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xuzifu666 commented on pull request #9054: [HUDI-6442] TestPartialUpdateForMergeInto should support mor
xuzifu666 commented on PR #9054: URL: https://github.com/apache/hudi/pull/9054#issuecomment-1608673153 currently patialupdate not support in mergeinto,so close the pr -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-5692) SpillableMapBasePath should be lazily loaded
[ https://issues.apache.org/jira/browse/HUDI-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hui An updated HUDI-5692: - Fix Version/s: 0.13.0 > SpillableMapBasePath should be lazily loaded > > > Key: HUDI-5692 > URL: https://issues.apache.org/jira/browse/HUDI-5692 > Project: Apache Hudi > Issue Type: Bug >Reporter: Hui An >Assignee: Hui An >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > > If we use {{withInferFunction}} to set the default value of > {{{}SPILLABLE_MAP_BASE_PATH{}}}, this default value will be set to > {{{}HoodieWriteConfig{}}}'s {{{}properties{}}}, and will be serialized to all > executors. This could introduce the issue that if the driver doesn't have the > same temporary location with the executors side(e.g. driver: /mnt/disk1, > executor: /mnt/disk2), the executor would throw error to create the spilled > map path(since the executor machine doesn't have the directory /mnt/disk1). > {code:java} > Caused by: org.apache.hudi.exception.HoodieIOException: Unable to create > :/mnt/ssd/0/yarn/nm-local-dir/usercache/test/appcache/application_1673593627114_3970647/hudi-BITCASK-e3741235-6571-4112-8b20-271408148238 > at > org.apache.hudi.common.util.collection.ExternalSpillableMap.getDiskBasedMap(ExternalSpillableMap.java:119) > at > org.apache.hudi.common.util.collection.ExternalSpillableMap.getDiskBasedMapNumEntries(ExternalSpillableMap.java:138) > at org.apache.hudi.io.HoodieMergeHandle.init(HoodieMergeHandle.java:268) > at org.apache.hudi.io.HoodieMergeHandle.(HoodieMergeHandle.java:129) > at org.apache.hudi.io.HoodieMergeHandle.(HoodieMergeHandle.java:121) > at org.apache.hudi.io.HoodieConcatHandle.(HoodieConcatHandle.java:81) > at > org.apache.hudi.io.HoodieMergeHandleFactory.create(HoodieMergeHandleFactory.java:60) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.getUpdateHandle(BaseSparkCommitActionExecutor.java:386) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:363) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:330) > ... 29 more > Caused by: java.io.IOException: Unable to create > :/mnt/ssd/0/yarn/nm-local-dir/usercache/test/appcache/application_1673593627114_3970647/hudi-BITCASK-e3741235-6571-4112-8b20-271408148238 > at org.apache.hudi.common.util.FileIOUtils.mkdir(FileIOUtils.java:70) > at org.apache.hudi.common.util.collection.DiskMap.(DiskMap.java:55) > at > org.apache.hudi.common.util.collection.BitCaskDiskMap.(BitCaskDiskMap.java:98) > at > org.apache.hudi.common.util.collection.ExternalSpillableMap.getDiskBasedMap(ExternalSpillableMap.java:116) > ... 38 more > > {code} > A better solution is to calculate the temporary location when calling > {{getSpillableMapBasePath}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] xuzifu666 commented on pull request #9052: [HUDI-6439] DirectWriteMarkers create file need judge appendfile whether exist
xuzifu666 commented on PR #9052: URL: https://github.com/apache/hudi/pull/9052#issuecomment-1608665812 > > it seems to be the same problem which should be fixed by a previous PR. May wait for further feedback. > > Thanks so much for the help. I close the pr firstly -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.
nsivabalan commented on code in PR #8837: URL: https://github.com/apache/hudi/pull/8837#discussion_r1243081122 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -851,26 +919,49 @@ public void update(HoodieRestoreMetadata restoreMetadata, String instantTime) { */ @Override public void update(HoodieRollbackMetadata rollbackMetadata, String instantTime) { -if (enabled && metadata != null) { - // Is this rollback of an instant that has been synced to the metadata table? - String rollbackInstant = rollbackMetadata.getCommitsRollback().get(0); - boolean wasSynced = metadataMetaClient.getActiveTimeline().containsInstant(new HoodieInstant(false, HoodieTimeline.DELTA_COMMIT_ACTION, rollbackInstant)); - if (!wasSynced) { -// A compaction may have taken place on metadata table which would have included this instant being rolled back. -// Revisit this logic to relax the compaction fencing : https://issues.apache.org/jira/browse/HUDI-2458 -Option latestCompaction = metadata.getLatestCompactionTime(); -if (latestCompaction.isPresent()) { - wasSynced = HoodieTimeline.compareTimestamps(rollbackInstant, HoodieTimeline.LESSER_THAN_OR_EQUALS, latestCompaction.get()); -} +// The commit which is being rolled back on the dataset +final String commitInstantTime = rollbackMetadata.getCommitsRollback().get(0); +// Find the deltacommits since the last compaction +Option> deltaCommitsInfo = + CompactionUtils.getDeltaCommitsSinceLatestCompaction(metadataMetaClient.getActiveTimeline()); +if (!deltaCommitsInfo.isPresent()) { + LOG.info(String.format("Ignoring rollback of instant %s at %s since there are no deltacommits on MDT", commitInstantTime, instantTime)); + return; +} + +// This could be a compaction or deltacommit instant (See CompactionUtils.getDeltaCommitsSinceLatestCompaction) +HoodieInstant compactionInstant = deltaCommitsInfo.get().getValue(); +HoodieTimeline deltacommitsSinceCompaction = deltaCommitsInfo.get().getKey(); + +// The deltacommit that will be rolled back +HoodieInstant deltaCommitInstant = new HoodieInstant(false, HoodieTimeline.DELTA_COMMIT_ACTION, commitInstantTime); + +// The commit being rolled back should not be older than the latest compaction on the MDT. Compaction on MDT only occurs when all actions +// are completed on the dataset. Hence, this case implies a rollback of completed commit which should actually be handled using restore. +if (compactionInstant.getAction().equals(HoodieTimeline.COMMIT_ACTION)) { + final String compactionInstantTime = compactionInstant.getTimestamp(); + if (HoodieTimeline.LESSER_THAN_OR_EQUALS.test(commitInstantTime, compactionInstantTime)) { +throw new HoodieMetadataException(String.format("Commit being rolled back %s is older than the latest compaction %s. " ++ "There are %d deltacommits after this compaction: %s", commitInstantTime, compactionInstantTime, +deltacommitsSinceCompaction.countInstants(), deltacommitsSinceCompaction.getInstants())); } +} - Map> records = - HoodieTableMetadataUtil.convertMetadataToRecords(engineContext, metadataMetaClient.getActiveTimeline(), - rollbackMetadata, getRecordsGenerationParams(), instantTime, - metadata.getSyncedInstantTime(), wasSynced); - commit(instantTime, records, false); - closeInternal(); +if (deltaCommitsInfo.get().getKey().containsInstant(deltaCommitInstant)) { + LOG.info("Rolling back MDT deltacommit " + commitInstantTime); + if (!getWriteClient().rollback(commitInstantTime, instantTime)) { +throw new HoodieMetadataException("Failed to rollback deltacommit at " + commitInstantTime); + } +} else { + LOG.info(String.format("Ignoring rollback of instant %s at %s since there are no corresponding deltacommits on MDT", + commitInstantTime, instantTime)); } + +// Rollback of MOR table may end up adding a new log file. So we need to check for added files and add them to MDT +processAndCommit(instantTime, () -> HoodieTableMetadataUtil.convertMetadataToRecords(engineContext, metadataMetaClient.getActiveTimeline(), +rollbackMetadata, getRecordsGenerationParams(), instantTime, +metadata.getSyncedInstantTime(), true), false); Review Comment: just wanted to double confirm. in the list of valid instants we populate while reading MDT using Log Record Reader, we do include rollback instants from DT right? How this might pan out, if a async compaction from DT is rolled back multiple times and then finally it gets committed? ``` public static Set getValidInstantTimestamps(HoodieTableMetaClient dataMetaClient,
[GitHub] [hudi] xuzifu666 closed pull request #9052: [HUDI-6439] DirectWriteMarkers create file need judge appendfile whether exist
xuzifu666 closed pull request #9052: [HUDI-6439] DirectWriteMarkers create file need judge appendfile whether exist URL: https://github.com/apache/hudi/pull/9052 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Resolved] (HUDI-5692) SpillableMapBasePath should be lazily loaded
[ https://issues.apache.org/jira/browse/HUDI-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hui An resolved HUDI-5692. -- > SpillableMapBasePath should be lazily loaded > > > Key: HUDI-5692 > URL: https://issues.apache.org/jira/browse/HUDI-5692 > Project: Apache Hudi > Issue Type: Bug >Reporter: Hui An >Assignee: Hui An >Priority: Major > Labels: pull-request-available > > If we use {{withInferFunction}} to set the default value of > {{{}SPILLABLE_MAP_BASE_PATH{}}}, this default value will be set to > {{{}HoodieWriteConfig{}}}'s {{{}properties{}}}, and will be serialized to all > executors. This could introduce the issue that if the driver doesn't have the > same temporary location with the executors side(e.g. driver: /mnt/disk1, > executor: /mnt/disk2), the executor would throw error to create the spilled > map path(since the executor machine doesn't have the directory /mnt/disk1). > {code:java} > Caused by: org.apache.hudi.exception.HoodieIOException: Unable to create > :/mnt/ssd/0/yarn/nm-local-dir/usercache/test/appcache/application_1673593627114_3970647/hudi-BITCASK-e3741235-6571-4112-8b20-271408148238 > at > org.apache.hudi.common.util.collection.ExternalSpillableMap.getDiskBasedMap(ExternalSpillableMap.java:119) > at > org.apache.hudi.common.util.collection.ExternalSpillableMap.getDiskBasedMapNumEntries(ExternalSpillableMap.java:138) > at org.apache.hudi.io.HoodieMergeHandle.init(HoodieMergeHandle.java:268) > at org.apache.hudi.io.HoodieMergeHandle.(HoodieMergeHandle.java:129) > at org.apache.hudi.io.HoodieMergeHandle.(HoodieMergeHandle.java:121) > at org.apache.hudi.io.HoodieConcatHandle.(HoodieConcatHandle.java:81) > at > org.apache.hudi.io.HoodieMergeHandleFactory.create(HoodieMergeHandleFactory.java:60) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.getUpdateHandle(BaseSparkCommitActionExecutor.java:386) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:363) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:330) > ... 29 more > Caused by: java.io.IOException: Unable to create > :/mnt/ssd/0/yarn/nm-local-dir/usercache/test/appcache/application_1673593627114_3970647/hudi-BITCASK-e3741235-6571-4112-8b20-271408148238 > at org.apache.hudi.common.util.FileIOUtils.mkdir(FileIOUtils.java:70) > at org.apache.hudi.common.util.collection.DiskMap.(DiskMap.java:55) > at > org.apache.hudi.common.util.collection.BitCaskDiskMap.(BitCaskDiskMap.java:98) > at > org.apache.hudi.common.util.collection.ExternalSpillableMap.getDiskBasedMap(ExternalSpillableMap.java:116) > ... 38 more > > {code} > A better solution is to calculate the temporary location when calling > {{getSpillableMapBasePath}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] danny0405 commented on pull request #9052: [HUDI-6439] DirectWriteMarkers create file need judge appendfile whether exist
danny0405 commented on PR #9052: URL: https://github.com/apache/hudi/pull/9052#issuecomment-1608659216 > it seems to be the same problem which should be fixed by a previous PR. May wait for further feedback. Thanks so much for the help. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9057: [HUDI-6446] Fixing MDT commit time parsing and RLI instantiation with MDT
danny0405 commented on code in PR #9057: URL: https://github.com/apache/hudi/pull/9057#discussion_r1243067339 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -344,6 +344,13 @@ private boolean initializeFromFilesystem(String initializationTime, List
[GitHub] [hudi] danny0405 commented on a diff in pull request #9049: [HUDI-6435] Add some logs to the updateHeartbeat method
danny0405 commented on code in PR #9049: URL: https://github.com/apache/hudi/pull/9049#discussion_r1243054136 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/ClientIds.java: ## @@ -167,6 +167,7 @@ private void updateHeartbeat(Path heartbeatFilePath) throws HoodieHeartbeatExcep this.fs.create(heartbeatFilePath, true); outputStream.close(); } catch (IOException io) { + LOG.error("Unable to generate heartbeat,heartbeatFilePath:{}", heartbeatFilePath, io); throw new HoodieHeartbeatException("Unable to generate heartbeat ", io); } Review Comment: We can remove the log: `LOG.error("Unable to generate heartbeat,heartbeatFilePath:{}", heartbeatFilePath, io);` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9049: [HUDI-6435] Add some logs to the updateHeartbeat method
danny0405 commented on code in PR #9049: URL: https://github.com/apache/hudi/pull/9049#discussion_r1243053950 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/heartbeat/HoodieHeartbeatClient.java: ## @@ -262,6 +262,7 @@ private void updateHeartbeat(String instantTime) throws HoodieHeartbeatException heartbeat.setLastHeartbeatTime(newHeartbeatTime); heartbeat.setNumHeartbeats(heartbeat.getNumHeartbeats() + 1); } catch (IOException io) { + LOG.error("Unable to generate heartbeat,instant:{}", instantTime, io); throw new HoodieHeartbeatException("Unable to generate heartbeat ", io); } Review Comment: We can remove the log: `LOG.error("Unable to generate heartbeat,instant:{}", instantTime, io);` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-5303) Allow users to control the concurrency to submit jobs in clustering
[ https://issues.apache.org/jira/browse/HUDI-5303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-5303: - Fix Version/s: 0.14.0 > Allow users to control the concurrency to submit jobs in clustering > --- > > Key: HUDI-5303 > URL: https://issues.apache.org/jira/browse/HUDI-5303 > Project: Apache Hudi > Issue Type: Improvement > Components: clustering, spark >Reporter: Hui An >Assignee: Hui An >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > If there are sufficient resources in the clustering job, some clustering > groups sometimes could still waits to be triggered, we use forkJoinPool to > submit these jobs, and it's also difficult for clients to adjust this > configure(--conf > spark.driver.extraJavaOptions=-Djava.util.concurrent.ForkJoinPool.common.parallelism), > and it could also affect other tasks using the forkJoinPool, so instead, we > introduce a new threadPool to control the submitting job parallelism for the > clustering. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-5303) Allow users to control the concurrency to submit jobs in clustering
[ https://issues.apache.org/jira/browse/HUDI-5303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-5303. Resolution: Fixed Fixed via master branch: 8eafe17a6a276b1384d2e4b528fd0abdf190bd84 > Allow users to control the concurrency to submit jobs in clustering > --- > > Key: HUDI-5303 > URL: https://issues.apache.org/jira/browse/HUDI-5303 > Project: Apache Hudi > Issue Type: Improvement > Components: clustering, spark >Reporter: Hui An >Assignee: Hui An >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > If there are sufficient resources in the clustering job, some clustering > groups sometimes could still waits to be triggered, we use forkJoinPool to > submit these jobs, and it's also difficult for clients to adjust this > configure(--conf > spark.driver.extraJavaOptions=-Djava.util.concurrent.ForkJoinPool.common.parallelism), > and it could also affect other tasks using the forkJoinPool, so instead, we > introduce a new threadPool to control the submitting job parallelism for the > clustering. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-5303] Allow users to control the concurrency to submit jobs in clustering (#7343)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 8eafe17a6a2 [HUDI-5303] Allow users to control the concurrency to submit jobs in clustering (#7343) 8eafe17a6a2 is described below commit 8eafe17a6a276b1384d2e4b528fd0abdf190bd84 Author: Rex(Hui) An AuthorDate: Tue Jun 27 09:56:25 2023 +0800 [HUDI-5303] Allow users to control the concurrency to submit jobs in clustering (#7343) --- .../apache/hudi/config/HoodieClusteringConfig.java | 9 +++ .../org/apache/hudi/config/HoodieWriteConfig.java | 4 ++ .../MultipleSparkJobExecutionStrategy.java | 66 +- 3 files changed, 53 insertions(+), 26 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java index cafed2febc6..e9ff847a6f0 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java @@ -156,6 +156,15 @@ public class HoodieClusteringConfig extends HoodieConfig { .sinceVersion("0.9.0") .withDocumentation("Config to control frequency of async clustering"); + public static final ConfigProperty CLUSTERING_MAX_PARALLELISM = ConfigProperty + .key("hoodie.clustering.max.parallelism") + .defaultValue(15) + .sinceVersion("0.14.0") + .withDocumentation("Maximum number of parallelism jobs submitted in clustering operation. " + + "If the resource is sufficient(Like Spark engine has enough idle executors), increasing this " + + "value will let the clustering job run faster, while it will give additional pressure to the " + + "execution engines to manage more concurrent running jobs."); + public static final ConfigProperty PLAN_STRATEGY_SKIP_PARTITIONS_FROM_LATEST = ConfigProperty .key(CLUSTERING_STRATEGY_PARAM_PREFIX + "daybased.skipfromlatest.partitions") .defaultValue("0") diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java index eba9728777f..7b672abf241 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java @@ -1634,6 +1634,10 @@ public class HoodieWriteConfig extends HoodieConfig { return getString(HoodieClusteringConfig.PLAN_STRATEGY_CLASS_NAME); } + public int getClusteringMaxParallelism() { +return getInt(HoodieClusteringConfig.CLUSTERING_MAX_PARALLELISM); + } + public ClusteringPlanPartitionFilterMode getClusteringPlanPartitionFilterMode() { String mode = getString(HoodieClusteringConfig.PLAN_PARTITION_FILTER_MODE_NAME); return ClusteringPlanPartitionFilterMode.valueOf(mode); diff --git a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java index 540da42fd78..c6a1df9105e 100644 --- a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java +++ b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java @@ -36,6 +36,7 @@ import org.apache.hudi.common.model.HoodieRecord; import org.apache.hudi.common.table.HoodieTableConfig; import org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner; import org.apache.hudi.common.util.CollectionUtils; +import org.apache.hudi.common.util.CustomizedThreadFactory; import org.apache.hudi.common.util.FutureUtils; import org.apache.hudi.common.util.Option; import org.apache.hudi.common.util.StringUtils; @@ -82,6 +83,8 @@ import java.util.Iterator; import java.util.List; import java.util.Map; import java.util.concurrent.CompletableFuture; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; import java.util.stream.Collectors; import java.util.stream.Stream; @@ -105,30 +108,39 @@ public abstract class MultipleSparkJobExecutionStrategy public HoodieWriteMetadata> performClustering(final HoodieClusteringPlan clusteringPlan, final Schema schema, final String instantTime) { JavaSparkContext engineContext = HoodieSparkEngineContext.getSparkContext(getEngineContext()); boolean shouldPreserveMetadata =
[GitHub] [hudi] danny0405 merged pull request #7343: [HUDI-5303] Allow users to control the concurrency to submit jobs in clustering
danny0405 merged PR #7343: URL: https://github.com/apache/hudi/pull/7343 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Coco0201 commented on issue #8371: [SUPPORT] Flink cant read metafield '_hoodie_commit_time'
Coco0201 commented on issue #8371: URL: https://github.com/apache/hudi/issues/8371#issuecomment-1608574627 > Did you declare the `_hoodie_commit_time` as a schema field in your table? I found the comma which is in the DDL of my flink table was forgotten.So there is no problem while reading metafields. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9037: [HUDI-6420] Fixing Hfile on-demand and prefix based reads to use optimized apis
nsivabalan commented on code in PR #9037: URL: https://github.com/apache/hudi/pull/9037#discussion_r1243027563 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java: ## @@ -195,11 +193,6 @@ protected ClosableIterator> lookupRecords(List keys, blockContentLoc.getContentPositionInLogFile(), blockContentLoc.getBlockSize()); -// HFile read will be efficient if keys are sorted, since on storage records are sorted by key. Review Comment: sure. we can fix that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8609: [HUDI-6154] Introduced retry while reading hoodie.properties to deal with parallel updates.
hudi-bot commented on PR #8609: URL: https://github.com/apache/hudi/pull/8609#issuecomment-1608548374 ## CI report: * e14bd41edf6cc961d77087eea67f755f23590834 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17992) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18115) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9057: [HUDI-6446] Fixing MDT commit time parsing and RLI instantiation with MDT
hudi-bot commented on PR #9057: URL: https://github.com/apache/hudi/pull/9057#issuecomment-1608518499 ## CI report: * aea6f0bb6a55c8019f34cf9b328abef34f0a5f01 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18116) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9041: [HUDI-6431] Support update partition path in record-level index
nsivabalan commented on code in PR #9041: URL: https://github.com/apache/hudi/pull/9041#discussion_r1242967907 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java: ## @@ -310,6 +312,56 @@ public static HoodieData> mergeForPartitionUpdates( return Arrays.asList(deleteRecord, getTaggedRecord(merged, Option.empty())).iterator(); } }); -return taggedUpdatingRecords.union(newRecords); +return taggedUpdatingRecords.union(taggedNewRecords); + } + + public static HoodieData> tagGlobalLocationBackToRecords( + HoodieData> incomingRecords, + HoodiePairData keyAndExistingLocations, + boolean mayContainDuplicateLookup, + boolean shouldUpdatePartitionPath, + HoodieWriteConfig config, + HoodieTable table) { +final HoodieRecordMerger merger = config.getRecordMerger(); + +HoodiePairData> keyAndIncomingRecords = +incomingRecords.mapToPair(record -> Pair.of(record.getRecordKey(), record)); + +// Pair of incoming record and the global location if meant for merged lookup in later stage +HoodieData, Option>> incomingRecordsAndLocations += keyAndIncomingRecords.leftOuterJoin(keyAndExistingLocations).values() +.map(v -> { + final HoodieRecord incomingRecord = v.getLeft(); + Option currentLocOpt = Option.ofNullable(v.getRight().orElse(null)); + if (currentLocOpt.isPresent()) { +HoodieRecordGlobalLocation currentLoc = currentLocOpt.get(); +boolean shouldPerformMergedLookUp = mayContainDuplicateLookup +|| !Objects.equals(incomingRecord.getPartitionPath(), currentLoc.getPartitionPath()); +if (shouldUpdatePartitionPath && shouldPerformMergedLookUp) { + return Pair.of(incomingRecord, currentLocOpt); +} else { + // - When update partition path is set to false, + // the incoming record will be tagged to the existing record's partition regardless of being equal or not. + // - When update partition path is set to true, + // the incoming record will be tagged to the existing record's partition + // when partition is not updated and the look-up won't have duplicates (e.g. COW, or using RLI). + return Pair.of((HoodieRecord) getTaggedRecord( Review Comment: got it ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java: ## @@ -310,6 +312,56 @@ public static HoodieData> mergeForPartitionUpdates( return Arrays.asList(deleteRecord, getTaggedRecord(merged, Option.empty())).iterator(); } }); -return taggedUpdatingRecords.union(newRecords); +return taggedUpdatingRecords.union(taggedNewRecords); + } + + public static HoodieData> tagGlobalLocationBackToRecords( + HoodieData> incomingRecords, + HoodiePairData keyAndExistingLocations, + boolean mayContainDuplicateLookup, + boolean shouldUpdatePartitionPath, + HoodieWriteConfig config, + HoodieTable table) { +final HoodieRecordMerger merger = config.getRecordMerger(); + +HoodiePairData> keyAndIncomingRecords = +incomingRecords.mapToPair(record -> Pair.of(record.getRecordKey(), record)); + +// Pair of incoming record and the global location if meant for merged lookup in later stage +HoodieData, Option>> incomingRecordsAndLocations += keyAndIncomingRecords.leftOuterJoin(keyAndExistingLocations).values() +.map(v -> { + final HoodieRecord incomingRecord = v.getLeft(); + Option currentLocOpt = Option.ofNullable(v.getRight().orElse(null)); + if (currentLocOpt.isPresent()) { +HoodieRecordGlobalLocation currentLoc = currentLocOpt.get(); +boolean shouldPerformMergedLookUp = mayContainDuplicateLookup +|| !Objects.equals(incomingRecord.getPartitionPath(), currentLoc.getPartitionPath()); +if (shouldUpdatePartitionPath && shouldPerformMergedLookUp) { + return Pair.of(incomingRecord, currentLocOpt); +} else { + // - When update partition path is set to false, + // the incoming record will be tagged to the existing record's partition regardless of being equal or not. + // - When update partition path is set to true, + // the incoming record will be tagged to the existing record's partition + // when partition is not updated and the look-up won't have duplicates (e.g. COW, or using RLI). + return Pair.of((HoodieRecord) getTaggedRecord( + createNewHoodieRecord(incomingRecord, currentLoc, merger), Option.of(currentLoc)), + Option.empty()); +} + } else { +return Pair.of(getTaggedRecord(incomingRecord, Option.empty()),
[GitHub] [hudi] hudi-bot commented on pull request #9057: [HUDI-6446] Fixing MDT commit time parsing and RLI instantiation with MDT
hudi-bot commented on PR #9057: URL: https://github.com/apache/hudi/pull/9057#issuecomment-1608500681 ## CI report: * aea6f0bb6a55c8019f34cf9b328abef34f0a5f01 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.
nsivabalan commented on code in PR #8837: URL: https://github.com/apache/hudi/pull/8837#discussion_r1242961967 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -851,26 +919,49 @@ public void update(HoodieRestoreMetadata restoreMetadata, String instantTime) { */ @Override public void update(HoodieRollbackMetadata rollbackMetadata, String instantTime) { -if (enabled && metadata != null) { - // Is this rollback of an instant that has been synced to the metadata table? - String rollbackInstant = rollbackMetadata.getCommitsRollback().get(0); - boolean wasSynced = metadataMetaClient.getActiveTimeline().containsInstant(new HoodieInstant(false, HoodieTimeline.DELTA_COMMIT_ACTION, rollbackInstant)); - if (!wasSynced) { -// A compaction may have taken place on metadata table which would have included this instant being rolled back. -// Revisit this logic to relax the compaction fencing : https://issues.apache.org/jira/browse/HUDI-2458 -Option latestCompaction = metadata.getLatestCompactionTime(); -if (latestCompaction.isPresent()) { - wasSynced = HoodieTimeline.compareTimestamps(rollbackInstant, HoodieTimeline.LESSER_THAN_OR_EQUALS, latestCompaction.get()); -} +// The commit which is being rolled back on the dataset +final String commitInstantTime = rollbackMetadata.getCommitsRollback().get(0); +// Find the deltacommits since the last compaction +Option> deltaCommitsInfo = + CompactionUtils.getDeltaCommitsSinceLatestCompaction(metadataMetaClient.getActiveTimeline()); +if (!deltaCommitsInfo.isPresent()) { + LOG.info(String.format("Ignoring rollback of instant %s at %s since there are no deltacommits on MDT", commitInstantTime, instantTime)); + return; +} + +// This could be a compaction or deltacommit instant (See CompactionUtils.getDeltaCommitsSinceLatestCompaction) +HoodieInstant compactionInstant = deltaCommitsInfo.get().getValue(); +HoodieTimeline deltacommitsSinceCompaction = deltaCommitsInfo.get().getKey(); + +// The deltacommit that will be rolled back +HoodieInstant deltaCommitInstant = new HoodieInstant(false, HoodieTimeline.DELTA_COMMIT_ACTION, commitInstantTime); + +// The commit being rolled back should not be older than the latest compaction on the MDT. Compaction on MDT only occurs when all actions +// are completed on the dataset. Hence, this case implies a rollback of completed commit which should actually be handled using restore. +if (compactionInstant.getAction().equals(HoodieTimeline.COMMIT_ACTION)) { + final String compactionInstantTime = compactionInstant.getTimestamp(); + if (HoodieTimeline.LESSER_THAN_OR_EQUALS.test(commitInstantTime, compactionInstantTime)) { +throw new HoodieMetadataException(String.format("Commit being rolled back %s is older than the latest compaction %s. " ++ "There are %d deltacommits after this compaction: %s", commitInstantTime, compactionInstantTime, +deltacommitsSinceCompaction.countInstants(), deltacommitsSinceCompaction.getInstants())); } +} - Map> records = - HoodieTableMetadataUtil.convertMetadataToRecords(engineContext, metadataMetaClient.getActiveTimeline(), - rollbackMetadata, getRecordsGenerationParams(), instantTime, - metadata.getSyncedInstantTime(), wasSynced); - commit(instantTime, records, false); - closeInternal(); +if (deltaCommitsInfo.get().getKey().containsInstant(deltaCommitInstant)) { + LOG.info("Rolling back MDT deltacommit " + commitInstantTime); + if (!getWriteClient().rollback(commitInstantTime, instantTime)) { +throw new HoodieMetadataException("Failed to rollback deltacommit at " + commitInstantTime); + } +} else { + LOG.info(String.format("Ignoring rollback of instant %s at %s since there are no corresponding deltacommits on MDT", + commitInstantTime, instantTime)); } + +// Rollback of MOR table may end up adding a new log file. So we need to check for added files and add them to MDT +processAndCommit(instantTime, () -> HoodieTableMetadataUtil.convertMetadataToRecords(engineContext, metadataMetaClient.getActiveTimeline(), +rollbackMetadata, getRecordsGenerationParams(), instantTime, +metadata.getSyncedInstantTime(), true), false); Review Comment: I get it. for MOR data table, rollback will add a new log file in DT. And so we need this to track adding the new file. But can we optimize this so that this gets triggered only for MOR table or only when there are files to be added. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:
[jira] [Updated] (HUDI-6446) Defer Initialization of MDT just at the end of first commit
[ https://issues.apache.org/jira/browse/HUDI-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6446: - Labels: pull-request-available (was: ) > Defer Initialization of MDT just at the end of first commit > > > Key: HUDI-6446 > URL: https://issues.apache.org/jira/browse/HUDI-6446 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata >Reporter: sivabalan narayanan >Priority: Major > Labels: pull-request-available > > For a fresh table, when both FILES and RLI is enabled, we use default values > for num file groups i.e 10 for RLI. and this also creates a log file and does > not create a base file since there are no records to instantiate as such. So, > we should defer the instantiation to later. either at the end of first commit > or when the data table has atleast 1 completed commit. > For an already existing table, this is not an issue since if there are valid > records, we will dynamically determine the number of file groups. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nsivabalan opened a new pull request, #9057: [HUDI-6446] Fixing MDT commit time parsing and RLI instantiation with MDT
nsivabalan opened a new pull request, #9057: URL: https://github.com/apache/hudi/pull/9057 ### Change Logs [HUDI-6446] Fixing MDT commit time parsing and RLI instantiation with MDT. For a fresh table, when both FILES and RLI is enabled, we use default values for num file groups i.e 10 for RLI. and this also creates a log file and does not create a base file since there are no records to instantiate as such. So, we should defer the instantiation to later. either at the end of first commit or when the data table has atleast 1 completed commit. For an already existing table, this is not an issue since if there are valid records, we will dynamically determine the number of file groups. ### Impact Deferring instantiation of RLI for a fresh table to later when we have atleast 1 completed commit in DT. ### Risk level (write none, low medium or high below) low. ### Documentation Update N/A ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-5300) Optimize initial commit w/ metadata table
[ https://issues.apache.org/jira/browse/HUDI-5300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan closed HUDI-5300. - Resolution: Fixed > Optimize initial commit w/ metadata table > - > > Key: HUDI-5300 > URL: https://issues.apache.org/jira/browse/HUDI-5300 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.0 > > > Initial commit w/ MDT could be huge. So, we have an opportunity to optimize > by leverage bulk_insert instead of regular upsert. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6446) Defer Initialization of MDT just at the end of first commit
[ https://issues.apache.org/jira/browse/HUDI-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-6446: -- Epic Link: HUDI-466 > Defer Initialization of MDT just at the end of first commit > > > Key: HUDI-6446 > URL: https://issues.apache.org/jira/browse/HUDI-6446 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata >Reporter: sivabalan narayanan >Priority: Major > > For a fresh table, when both FILES and RLI is enabled, we use default values > for num file groups i.e 10 for RLI. and this also creates a log file and does > not create a base file since there are no records to instantiate as such. So, > we should defer the instantiation to later. either at the end of first commit > or when the data table has atleast 1 completed commit. > For an already existing table, this is not an issue since if there are valid > records, we will dynamically determine the number of file groups. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6446) Defer Initialization of MDT just at the end of first commit
sivabalan narayanan created HUDI-6446: - Summary: Defer Initialization of MDT just at the end of first commit Key: HUDI-6446 URL: https://issues.apache.org/jira/browse/HUDI-6446 Project: Apache Hudi Issue Type: Improvement Components: metadata Reporter: sivabalan narayanan For a fresh table, when both FILES and RLI is enabled, we use default values for num file groups i.e 10 for RLI. and this also creates a log file and does not create a base file since there are no records to instantiate as such. So, we should defer the instantiation to later. either at the end of first commit or when the data table has atleast 1 completed commit. For an already existing table, this is not an issue since if there are valid records, we will dynamically determine the number of file groups. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-5451) Ensure switching "001" and "002" suffix for compaction and cleaning in MDT is backwards compatible
[ https://issues.apache.org/jira/browse/HUDI-5451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan closed HUDI-5451. - Resolution: Invalid > Ensure switching "001" and "002" suffix for compaction and cleaning in MDT is > backwards compatible > --- > > Key: HUDI-5451 > URL: https://issues.apache.org/jira/browse/HUDI-5451 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Fix For: 0.14.0 > > > as per master, we suffix, "001" for compaction and "002" for cleaning for > MDT. > but w/ record level index support, we are changing that. we are setting "001" > for new partition initialization, "002"f or compaction and "003" for cleaning. > for newer tables its not an issue. but for an existing table, we need to > ensure its backwards compatible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8978: [HUDI-6315] Optimize DELETE codepath to use meta fields instead of key generation and index lookup
nsivabalan commented on code in PR #8978: URL: https://github.com/apache/hudi/pull/8978#discussion_r1242926341 ## hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/table/action/commit/FlinkDeletePreppedCommitActionExecutor.java: ## @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.table.action.commit; + +import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.common.engine.HoodieEngineContext; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.WriteOperationType; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.io.HoodieWriteHandle; +import org.apache.hudi.table.HoodieTable; +import org.apache.hudi.table.action.HoodieWriteMetadata; + +import java.util.List; + +/** + * Flink upsert prepped commit action executor. + */ +public class FlinkDeletePreppedCommitActionExecutor extends BaseFlinkCommitActionExecutor { + + private final List> preppedRecords; + + public FlinkDeletePreppedCommitActionExecutor(HoodieEngineContext context, Review Comment: Can you file a ticket or adding tests for delete prepped for flink. for spark, lets add tests in this patch only. ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/HoodieSparkMergeOnReadTable.java: ## @@ -105,6 +106,11 @@ public HoodieWriteMetadata> delete(HoodieEngineContext c return new SparkDeleteDeltaCommitActionExecutor<>((HoodieSparkEngineContext) context, config, this, instantTime, keys).execute(); } + @Override + public HoodieWriteMetadata> deletePrepped(HoodieEngineContext context, String instantTime, HoodieData> preppedRecords) { Review Comment: my bad. thanks ## hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/HoodieSparkRecordMerger.java: ## @@ -41,13 +42,30 @@ public Option> merge(HoodieRecord older, Schema oldSc ValidationUtils.checkArgument(older.getRecordType() == HoodieRecordType.SPARK); ValidationUtils.checkArgument(newer.getRecordType() == HoodieRecordType.SPARK); -if (newer.getData() == null) { - // Delete record - return Option.empty(); +if (newer instanceof HoodieSparkRecord) { + HoodieSparkRecord newSparkRecord = (HoodieSparkRecord) newer; + if (newSparkRecord.isDeleted()) { +// Delete record +return Option.empty(); + } +} else { + if (newer.getData() == null) { Review Comment: we need to understand whats going on in that test ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java: ## @@ -247,6 +247,15 @@ public JavaRDD delete(JavaRDD keys, String instantTime) return postWrite(resultRDD, instantTime, table); } + @Override + public JavaRDD deletePrepped(JavaRDD> preppedRecord, String instantTime) { Review Comment: we might need to add tests for this ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -349,9 +366,9 @@ object HoodieSparkSqlWriter { // Remove meta columns from writerSchema if isPrepped is true. val isPrepped = hoodieConfig.getBooleanOrDefault(DATASOURCE_WRITE_PREPPED_KEY, false) val processedDataSchema = if (isPrepped) { - HoodieAvroUtils.removeMetadataFields(writerSchema); + HoodieAvroUtils.removeMetadataFields(writerSchema) Review Comment: guess this has to be ``` HoodieAvroUtils.removeMetadataFields(dataFileSchema) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8609: [HUDI-6154] Introduced retry while reading hoodie.properties to deal with parallel updates.
hudi-bot commented on PR #8609: URL: https://github.com/apache/hudi/pull/8609#issuecomment-1608387197 ## CI report: * e14bd41edf6cc961d77087eea67f755f23590834 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17992) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18115) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] neerajpadarthi commented on issue #9050: [SUPPORT] Hudi Metadata BloomIndex stats failed (Failed to get the bloom filter)
neerajpadarthi commented on issue #9050: URL: https://github.com/apache/hudi/issues/9050#issuecomment-1608383218 Hey @ad1happy2go, thanks for checking. I have tested using 0.12V, it worked when the 1st and corresponding commits used 0.12v. But the Ingestion failed when performing an upsert using 0.12v on the 0.11v dataset(The Initial dump was loaded using 0.11V). Is this an expected scenario? N can you also please let me know the process of migrating the datasets from 0.11v to 0.12v. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dineshbganesan closed issue #9024: Clustering is not picking all partitions
dineshbganesan closed issue #9024: Clustering is not picking all partitions URL: https://github.com/apache/hudi/issues/9024 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on pull request #8609: [HUDI-6154] Introduced retry while reading hoodie.properties to deal with parallel updates.
prashantwason commented on PR #8609: URL: https://github.com/apache/hudi/pull/8609#issuecomment-1608381677 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] kazdy commented on pull request #9056: [DOC] Add parquet blooms documentation
kazdy commented on PR #9056: URL: https://github.com/apache/hudi/pull/9056#issuecomment-1608284752 @parisni I think you need to add it to "current" docs version as well, if you want to have it copied over to 0.14 docs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on a diff in pull request #9037: [HUDI-6420] Fixing Hfile on-demand and prefix based reads to use optimized apis
prashantwason commented on code in PR #9037: URL: https://github.com/apache/hudi/pull/9037#discussion_r1242686769 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java: ## @@ -195,11 +193,6 @@ protected ClosableIterator> lookupRecords(List keys, blockContentLoc.getContentPositionInLogFile(), blockContentLoc.getBlockSize()); -// HFile read will be efficient if keys are sorted, since on storage records are sorted by key. Review Comment: Removing this means that if there is any code path (existing or introduced tomorrow) that does not sort the keys then we may have misses from the MDT. This could lead to data quality issues. If we do not want to have the overhead of re-sorting a sorted array (how much is the overhead?) then we atleast need to add some checks here that the current keys is greater than the previous key in the getRecordsByKeysIterator and getRecordsByKeyPrefixIterator. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6445) Fix CI stability Jun 26, 2023
[ https://issues.apache.org/jira/browse/HUDI-6445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-6445: -- Description: CI has been unstable for the past few weeks. we need to triage them and fix it. UT-spark datasource module times out after 3 hours. [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=17956=logs=b1544eb9-7ff1-5db9-0187-3e05abf459bc] * Looks like top 10 tests were taking 30 to 40 secs and now its taking 40 to 50 secs or more and hence reaching 3 hours {code:java} 2023-06-20T05:03:58.6566739Z 52.124 org.apache.hudi.functional.TestIncrementalReadWithFullTableScan testFailEarlyForIncrViewQue ryForNonExistingFiles{HoodieTableType}[2] 2023-06-20T05:03:58.6567324Z 49.446 org.apache.hudi.functional.TestIncrementalReadWithFullTableScan testFailEarlyForIncrViewQue ryForNonExistingFiles{HoodieTableType}[1] 2023-06-20T05:03:58.6568005Z 48.659 org.apache.hudi.functional.cdc.TestCDCDataFrameSuite testMORDataSourceWrite{HoodieCDCSupple mentalLoggingMode}[1] 2023-06-20T05:03:58.6568471Z 47.799 org.apache.hudi.functional.cdc.TestCDCDataFrameSuite testMORDataSourceWrite{HoodieCDCSupple mentalLoggingMode}[3] 2023-06-20T05:03:58.6569093Z 47.586 org.apache.hudi.functional.cdc.TestCDCDataFrameSuite testMORDataSourceWrite{HoodieCDCSupple mentalLoggingMode}[2] 2023-06-20T05:03:58.6569503Z 41.208 org.apache.hudi.functional.TestMORDataSource testCount{HoodieRecordType, HoodieRecordType, String}[2] 2023-06-20T05:03:58.6570090Z 41.034 org.apache.hudi.functional.TestMORDataSource testCount{HoodieRecordType, HoodieRecordType, String}[4] 2023-06-20T05:03:58.6570501Z 40.225 org.apache.hudi.functional.TestMORDataSource testCount{HoodieRecordType, HoodieRecordType, String}[3] 2023-06-20T05:03:58.6571231Z 39.853 org.apache.hudi.functional.cdc.TestCDCDataFrameSuite testCOWDataSourceWrite{HoodieCDCSupple mentalLoggingMode}[1] 2023-06-20T05:03:58.6574224Z 39.357 org.apache.hudi.functional.TestMORDataSource testCount{HoodieRecordType, HoodieRecordType, String}[1] 2023-06-20T05:03:58.6575261Z 38.995 org.apache.hudi.functional.cdc.TestCDCDataFrameSuite testCOWDataSourceWrite{HoodieCDCSupple mentalLoggingMode}[3] 2023-06-20T05:03:58.6575765Z 38.846 org.apache.hudi.functional.cdc.TestCDCDataFrameSuite testCOWDataSourceWrite{HoodieCDCSupple mentalLoggingMode}[2] 2023-06-20T05:03:58.6576470Z 35.404 org.apache.hudi.functional.TestMORDataSourceWithBucketIndex testCountWithBucketIndex {code} TestHoodieDeltaStreamer.testUpsertsMORContinuousMode and testAsyncClusteringServiceWithCompaction [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18111/logs/19] TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWritersForConflicts [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18080/logs/21] TestWriteMergeOnRead.testUpsert [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/17993/logs/35] TestWriteMergeOnReadWithCompact.testUpsert TestWriteCopyOnWrite.testSubtaskFails [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18110/logs/30] was: CI has been unstable for the past few weeks. we need to triage them and fix it. UT-spark datasource module times out after 3 hours. [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=17956=logs=b1544eb9-7ff1-5db9-0187-3e05abf459bc] TestHoodieDeltaStreamer.testUpsertsMORContinuousMode and testAsyncClusteringServiceWithCompaction [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18111/logs/19] TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWritersForConflicts [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18080/logs/21] TestWriteMergeOnRead.testUpsert [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/17993/logs/35] TestWriteMergeOnReadWithCompact.testUpsert TestWriteCopyOnWrite.testSubtaskFails [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18110/logs/30] > Fix CI stability Jun 26, 2023 > - > > Key: HUDI-6445 > URL: https://issues.apache.org/jira/browse/HUDI-6445 > Project: Apache Hudi > Issue Type: Test > Components: tests-ci >Reporter: sivabalan narayanan >Priority: Major > > CI has been unstable for the past few weeks. we need to triage them and fix > it. > > > UT-spark datasource module times out after 3 hours. >
[GitHub] [hudi] guanziyue commented on pull request #9052: [HUDI-6439] DirectWriteMarkers create file need judge appendfile whether exist
guanziyue commented on PR #9052: URL: https://github.com/apache/hudi/pull/9052#issuecomment-1607961507 > @guanziyue Can you take a look at this PR, the background is when bucket index is used for Spark engine, the exception happens in very high odds, is there any good idea we can strength the usability? Thanks Danny. Got more info from author side, it seems to be the same problem which should be fixed by a previous PR. May wait for further feedback. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] guanziyue commented on pull request #9052: [HUDI-6439] DirectWriteMarkers create file need judge appendfile whether exist
guanziyue commented on PR #9052: URL: https://github.com/apache/hudi/pull/9052#issuecomment-1607958298 > > May I know if this still occur after [HUDI-6401](https://issues.apache.org/jira/browse/HUDI-6401) is merged? And if so, could you also share the stacktrace including HoodieWriteHandle code path? > > yes,use the master branch,and fix like the current pr can fix it,error code path like above stack As we discussed offline, could you pls kindly have a try of [HUDI-6401](https://issues.apache.org/jira/browse/HUDI-6401)? Looking forward to your feedback! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6445) Fix CI stability Jun 26, 2023
[ https://issues.apache.org/jira/browse/HUDI-6445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-6445: -- Description: CI has been unstable for the past few weeks. we need to triage them and fix it. UT-spark datasource module times out after 3 hours. [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=17956=logs=b1544eb9-7ff1-5db9-0187-3e05abf459bc] TestHoodieDeltaStreamer.testUpsertsMORContinuousMode and testAsyncClusteringServiceWithCompaction [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18111/logs/19] TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWritersForConflicts [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18080/logs/21] TestWriteMergeOnRead.testUpsert [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/17993/logs/35] TestWriteMergeOnReadWithCompact.testUpsert TestWriteCopyOnWrite.testSubtaskFails [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18110/logs/30] was: CI has been unstable for the past few weeks. we need to triage them and fix it. UT-spark datasource module times out after 3 hours. [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=17956=logs=b1544eb9-7ff1-5db9-0187-3e05abf459bc] TestHoodieDeltaStreamer.testUpsertsMORContinuousMode and testAsyncClusteringServiceWithCompaction [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18111/logs/19] TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWritersForConflicts [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18080/logs/21] > Fix CI stability Jun 26, 2023 > - > > Key: HUDI-6445 > URL: https://issues.apache.org/jira/browse/HUDI-6445 > Project: Apache Hudi > Issue Type: Test > Components: tests-ci >Reporter: sivabalan narayanan >Priority: Major > > CI has been unstable for the past few weeks. we need to triage them and fix > it. > > > UT-spark datasource module times out after 3 hours. > [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=17956=logs=b1544eb9-7ff1-5db9-0187-3e05abf459bc] > > TestHoodieDeltaStreamer.testUpsertsMORContinuousMode > and testAsyncClusteringServiceWithCompaction > [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18111/logs/19] > > TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWritersForConflicts > [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18080/logs/21] > > TestWriteMergeOnRead.testUpsert > [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/17993/logs/35] > > TestWriteMergeOnReadWithCompact.testUpsert > TestWriteCopyOnWrite.testSubtaskFails > [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18110/logs/30] > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6445) Fix CI stability Jun 26, 2023
[ https://issues.apache.org/jira/browse/HUDI-6445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-6445: -- Description: CI has been unstable for the past few weeks. we need to triage them and fix it. UT-spark datasource module times out after 3 hours. [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=17956=logs=b1544eb9-7ff1-5db9-0187-3e05abf459bc] TestHoodieDeltaStreamer.testUpsertsMORContinuousMode and testAsyncClusteringServiceWithCompaction [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18111/logs/19] TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWritersForConflicts [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18080/logs/21] was:CI has been unstable for the past few weeks. we need to triage them and fix it > Fix CI stability Jun 26, 2023 > - > > Key: HUDI-6445 > URL: https://issues.apache.org/jira/browse/HUDI-6445 > Project: Apache Hudi > Issue Type: Test > Components: tests-ci >Reporter: sivabalan narayanan >Priority: Major > > CI has been unstable for the past few weeks. we need to triage them and fix > it. > > > UT-spark datasource module times out after 3 hours. > [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=17956=logs=b1544eb9-7ff1-5db9-0187-3e05abf459bc] > > TestHoodieDeltaStreamer.testUpsertsMORContinuousMode > and testAsyncClusteringServiceWithCompaction > [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18111/logs/19] > > TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWritersForConflicts > [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/18080/logs/21] > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6445) Fix CI stability Jun 26, 2023
[ https://issues.apache.org/jira/browse/HUDI-6445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-6445: -- Epic Link: HUDI-4302 > Fix CI stability Jun 26, 2023 > - > > Key: HUDI-6445 > URL: https://issues.apache.org/jira/browse/HUDI-6445 > Project: Apache Hudi > Issue Type: Test > Components: tests-ci >Reporter: sivabalan narayanan >Priority: Major > > CI has been unstable for the past few weeks. we need to triage them and fix it -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6445) Fix CI stability Jun 26, 2023
sivabalan narayanan created HUDI-6445: - Summary: Fix CI stability Jun 26, 2023 Key: HUDI-6445 URL: https://issues.apache.org/jira/browse/HUDI-6445 Project: Apache Hudi Issue Type: Test Components: tests-ci Reporter: sivabalan narayanan CI has been unstable for the past few weeks. we need to triage them and fix it -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nbalajee commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …
nbalajee commented on code in PR #9035: URL: https://github.com/apache/hudi/pull/9035#discussion_r1242484847 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -612,6 +612,20 @@ public class HoodieWriteConfig extends HoodieConfig { .sinceVersion("0.10.0") .withDocumentation("File Id Prefix provider class, that implements `org.apache.hudi.fileid.FileIdPrefixProvider`"); + public static final ConfigProperty ENFORCE_COMPLETION_MARKER_CHECKS = ConfigProperty + .key("hoodie.markers.enforce.completion.checks") + .defaultValue("false") + .sinceVersion("0.10.0") + .withDocumentation("Prevents the creation of duplicate data files, when multiple spark tasks are racing to " + + "create data files and a completed data file is already present"); + + public static final ConfigProperty ENFORCE_FINALIZE_WRITE_CHECK = ConfigProperty + .key("hoodie.markers.enforce.finalize.write.check") + .defaultValue("false") + .sinceVersion("0.10.0") Review Comment: will do. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nbalajee commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …
nbalajee commented on code in PR #9035: URL: https://github.com/apache/hudi/pull/9035#discussion_r1242484399 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java: ## @@ -512,6 +512,7 @@ public List close() { status.getStat().setFileSizeInBytes(logFileSize); } + createCompletedMarkerFile(partitionPath, baseInstantTime); Review Comment: Will update the diff, after adding this check. (We have this enabled by default, it make sense to wrap it up with the flag for OSS). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nbalajee commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …
nbalajee commented on code in PR #9035: URL: https://github.com/apache/hudi/pull/9035#discussion_r1242482645 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/TimelineServerBasedWriteMarkers.java: ## @@ -132,6 +153,25 @@ public Set allMarkerFilePaths() { } } + @Override + public void createMarkerDir() throws HoodieIOException { +HoodieTimer timer = new HoodieTimer().startTimer(); +Map paramsMap = new HashMap<>(); +paramsMap.put(MARKER_DIR_PATH_PARAM, markerDirPath.toString()); Review Comment: Currently, the Timeline server based markers is designed using this mechanism. Mainly used for cloud based solutions. @nsivabalan @yihua can add details. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nbalajee commented on pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …
nbalajee commented on PR #9035: URL: https://github.com/apache/hudi/pull/9035#issuecomment-1607857303 > Thanks for the contribution @nbalajee , In general I'm confused why we need two marker files for each base file, before the patch, we have in-progress marker file and write status real paths, we can diff out the corrupt/retry files by comparing the in-progress marker file handles and the paths recorded in writestatus. > > And we also have some instant completion check in HoodieFileSystemView, to ignore the files/file blocks that are still pending, so why the reader view could read data sets that are not intented to be exposed? Thanks for your review @dannyhchen and @nsivabalan for the review. > Thanks for the contribution @nbalajee , In general I'm confused why we need two marker files for each base file, before the patch, we have in-progress marker file and write status real paths, we can diff out the corrupt/retry files by comparing the in-progress marker file handles and the paths recorded in writestatus. > > And we also have some instant completion check in HoodieFileSystemView, to ignore the files/file blocks that are still pending, so why the reader view could read data sets that are not intented to be exposed? Following diagram summarizes the issue. (a) when a batch of records given to an executor for writing, spills over to multiple data files (split into multiple parts due to file size limits, f1-0_w1_c1.parquet, f1-1_w1_c1.parquet etc) (b) A spark stage is retried as a result all tasks are retried (some of the tasks from previous attempts could still be on-going). Mainly happens with spark fetchfailed exception. ![Screenshot 2023-06-25 at 9 15 35 PM](https://github.com/apache/hudi/assets/47542891/7121d7e6-e624-4743-ad00-004fde3e8344) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nbalajee commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …
nbalajee commented on code in PR #9035: URL: https://github.com/apache/hudi/pull/9035#discussion_r1242478376 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java: ## @@ -138,9 +139,35 @@ protected Path makeNewFilePath(String partitionPath, String fileName) { * * @param partitionPath Partition path */ - protected void createMarkerFile(String partitionPath, String dataFileName) { -WriteMarkersFactory.get(config.getMarkersType(), hoodieTable, instantTime) -.create(partitionPath, dataFileName, getIOType(), config, fileId, hoodieTable.getMetaClient().getActiveTimeline()); + protected void createInProgressMarkerFile(String partitionPath, String dataFileName, String markerInstantTime) { +WriteMarkers writeMarkers = WriteMarkersFactory.get(config.getMarkersType(), hoodieTable, instantTime); +if (!writeMarkers.doesMarkerDirExist()) { Review Comment: If we allow the markerDir to be created on a need basis, a stray executor starting to write to a file would create the directory after the finalize write and end up leaving a duplicate file. By creating the markerDir at the time of startCommit() and deleting the directory at/after the finalizeWrite(), we ensure that executors can't start a new write operation or successfully close an on-going write operation, if markerDir is missing (deleted by finalizeWrite). ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java: ## @@ -138,9 +139,35 @@ protected Path makeNewFilePath(String partitionPath, String fileName) { * * @param partitionPath Partition path */ - protected void createMarkerFile(String partitionPath, String dataFileName) { -WriteMarkersFactory.get(config.getMarkersType(), hoodieTable, instantTime) -.create(partitionPath, dataFileName, getIOType(), config, fileId, hoodieTable.getMetaClient().getActiveTimeline()); + protected void createInProgressMarkerFile(String partitionPath, String dataFileName, String markerInstantTime) { +WriteMarkers writeMarkers = WriteMarkersFactory.get(config.getMarkersType(), hoodieTable, instantTime); +if (!writeMarkers.doesMarkerDirExist()) { + throw new HoodieIOException(String.format("Marker root directory absent : %s/%s (%s)", + partitionPath, dataFileName, markerInstantTime)); +} +if (config.enforceFinalizeWriteCheck() +&& writeMarkers.markerExists(writeMarkers.getCompletionMarkerPath("", "FINALIZE_WRITE", markerInstantTime, IOType.CREATE))) { + throw new HoodieCorruptedDataException("Reconciliation for instant " + instantTime + " is completed, job is trying to re-write the data files."); +} +if (config.enforceCompletionMarkerCheck() +&& writeMarkers.markerExists(writeMarkers.getCompletionMarkerPath(partitionPath, fileId, markerInstantTime, getIOType( { + throw new HoodieIOException("Completed marker file exists for : " + dataFileName + " (" + instantTime + ")"); +} +writeMarkers.create(partitionPath, dataFileName, getIOType()); + } + + // visible for testing + public void createCompletedMarkerFile(String partition, String markerInstantTime) throws IOException { +try { + WriteMarkersFactory.get(config.getMarkersType(), hoodieTable, instantTime) + .createCompletionMarker(partition, fileId, markerInstantTime, getIOType(), true); +} catch (Exception e) { + // Clean up the data file, if the marker is already present or marker directories don't exist. + Path partitionPath = FSUtils.getPartitionPath(hoodieTable.getMetaClient().getBasePath(), partition); Review Comment: After the finalizeWrite and reconciling the files, we delete the markerDirectory. If a stray executor were to complete the write operation and close the file after the reconcile step, it would find markerDirectory missing and would cleanup the datafile created. ![Screenshot 2023-06-25 at 9 15 35 PM](https://github.com/apache/hudi/assets/47542891/f84e70f9-5f17-4454-8ff1-608c59056ef3) In the example, executor C trying to close the file, after finalizeWrite operation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nbalajee commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …
nbalajee commented on code in PR #9035: URL: https://github.com/apache/hudi/pull/9035#discussion_r1242478130 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java: ## @@ -901,6 +901,9 @@ private void startCommit(String instantTime, String actionType, HoodieTableMetaC metaClient.getActiveTimeline().createNewInstant(new HoodieInstant(HoodieInstant.State.REQUESTED, actionType, instantTime)); } + +// populate marker directory for the commit. +WriteMarkersFactory.get(config.getMarkersType(), createTable(config, hadoopConf), instantTime).createMarkerDir(); Review Comment: That is the current behavior, DoesMarkerDirExists() check ensures that an executor can't start/complete the write operation, after finalizeWrite(). ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -612,6 +612,20 @@ public class HoodieWriteConfig extends HoodieConfig { .sinceVersion("0.10.0") .withDocumentation("File Id Prefix provider class, that implements `org.apache.hudi.fileid.FileIdPrefixProvider`"); + public static final ConfigProperty ENFORCE_COMPLETION_MARKER_CHECKS = ConfigProperty + .key("hoodie.markers.enforce.completion.checks") + .defaultValue("false") + .sinceVersion("0.10.0") + .withDocumentation("Prevents the creation of duplicate data files, when multiple spark tasks are racing to " + + "create data files and a completed data file is already present"); + + public static final ConfigProperty ENFORCE_FINALIZE_WRITE_CHECK = ConfigProperty + .key("hoodie.markers.enforce.finalize.write.check") + .defaultValue("false") + .sinceVersion("0.10.0") + .withDocumentation("When WriteStatus obj is lost due to engine related failures, then recomputing would involve " + + "re-writing all the data files. When this check is enabled it would block the rewrite from happening."); Review Comment: I will update the doc. Context: This check was added to address the following scenario: (1) as part of the insert/upsert operation, a set of files have been created (p1/f1_w1_c1.parquet, p2/f2_w2_c1.parquet - corresponding to commit c1). (2) FinalizeWrite() successfully purged, files that were created, but not part of the writeStatus. (3) As part of completing the commit c1, we will update the MDT with fileListing and RLI metadata. In order to update the record index, when iterating over the writeStatuses, if writeStatus RDD blocks are found to be missing, execution engine (spark) would re-trigger the write stage (to recreate the write statuses). Above flag is used to avoid rewriting all the files as part of stage retry (which is more likely to fail during the second attempt). Instead, we fail the job so that next write attempt can be made in a new job (after any required resource tuning). Not an issue for small/medium sized tables. We have seen this only on large tables (> 50B+ records). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9054: [HUDI-6442] TestPartialUpdateForMergeInto should support mor
hudi-bot commented on PR #9054: URL: https://github.com/apache/hudi/pull/9054#issuecomment-1607783798 ## CI report: * 3819ebe617f8338430fc1d1058f7e3938a6770e8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18114) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stp-pv commented on issue #9032: [SUPPORT] NPE when reading from a CDC-enabled table but written with BULK_INSERT
stp-pv commented on issue #9032: URL: https://github.com/apache/hudi/issues/9032#issuecomment-1607785019 We are seeing the problem with insert as well. Here is the most simple fix for the problem we are observing: ```diff diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/HoodieCDCRDD.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/HoodieCDCRDD.scala index b42e6f8800..a0531772db 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/HoodieCDCRDD.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/HoodieCDCRDD.scala @@ -561,7 +561,7 @@ class HoodieCDCRDD( originTableSchema.structTypeSchema.zipWithIndex.foreach { case (field, idx) => if (field.dataType.isInstanceOf[StringType]) { -map(field.name) = record.getString(idx) +map(field.name) = Option(record.getUTF8String(idx)).map(_.toString).orNull } else { map(field.name) = record.get(idx, field.dataType) } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ad1happy2go commented on issue #9032: [SUPPORT] NPE when reading from a CDC-enabled table but written with BULK_INSERT
ad1happy2go commented on issue #9032: URL: https://github.com/apache/hudi/issues/9032#issuecomment-1607770606 @zaza Thanks for the information. I am able to reproduce it with values null in one of the column. Also confirmed this is only happening with bulk_insert. I will check with master code once and then create a JIRA to fix it if its still the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9037: [HUDI-6420] Fixing Hfile on-demand and prefix based reads to use optimized apis
nsivabalan commented on code in PR #9037: URL: https://github.com/apache/hudi/pull/9037#discussion_r1242392462 ## hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java: ## @@ -206,7 +198,12 @@ protected ClosableIterator getIndexedRecordIterator(Schema reader } // TODO eval whether seeking scanner would be faster than pread -HFileScanner scanner = getHFileScanner(reader, false); +HFileScanner scanner = null; +try { + scanner = getHFileScanner(reader, false, false); +} catch (IOException e) { + throw new HoodieIOException("Instantiation HfileScanner failed for " + reader.getHFileInfo().toString()); +} Review Comment: every other method in the interface throws IOException except this method. So, I let it as is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] parisni opened a new pull request, #9056: [DOC] Add parquet blooms documentation
parisni opened a new pull request, #9056: URL: https://github.com/apache/hudi/pull/9056 ### Change Logs This adds doc for the parquet bloom feature. I added it in 0.13.1, but this likely should be moved to 0.14 ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-5447) Add support for Record level index read from MDT
[ https://issues.apache.org/jira/browse/HUDI-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan closed HUDI-5447. - Resolution: Fixed > Add support for Record level index read from MDT > > > Key: HUDI-5447 > URL: https://issues.apache.org/jira/browse/HUDI-5447 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.0 > > > introduce a new index which will leverage record level index partition in MDT > and assist in tag locations. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-5446) Add support to write record level index to MDT
[ https://issues.apache.org/jira/browse/HUDI-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan closed HUDI-5446. - Resolution: Fixed > Add support to write record level index to MDT > -- > > Key: HUDI-5446 > URL: https://issues.apache.org/jira/browse/HUDI-5446 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.0 > > > Add support to write our record level index partition to MDT -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-5444) FileNotFound issue w/ metadata enabled
[ https://issues.apache.org/jira/browse/HUDI-5444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan closed HUDI-5444. - Resolution: Invalid > FileNotFound issue w/ metadata enabled > -- > > Key: HUDI-5444 > URL: https://issues.apache.org/jira/browse/HUDI-5444 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Fix For: 0.14.0 > > > stacktrace > {code:java} > Caused by: java.io.FileNotFoundException: File not found: > gs://TBL_PATH/op_cmpny_cd=WMT.COM/order_placed_dt=2022-12-08/441e7909-6a62-45ac-b9df-dd0386574f52-0_607-17-2433_20221208132316380.parquet > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1082) > {code} > > 20221208133227028 (RB_C10) > 20221208133227028001 MDT compaction > 20221208132316380 (C10) > 20221208133647230 > DT > 8 │ 20221202234515099 │ rollback │ COMPLETED │ Rolls back │ 12-02 > 15:45:18 │ 12-02 15:45:18 │ 12-02 15:45:33 ║ > ║ │ │ │ │ 2022120413756 │ > │ │ ║ > ╟─┼───┼──┼───┼───┼┼┼╢ > ║ 9 │ 20221208133227028 │ rollback │ COMPLETED │ Rolls back │ 12-08 > 05:32:33 │ 12-08 05:32:33 │ 12-08 05:32:44 ║ > ║ │ │ │ │ 20221208132316380 │ > │ │ ║ > ╟─┼───┼──┼───┼───┼┼┼╢ > ║ 10 │ 20221208133647230 │ rollback │ COMPLETED │ Rolls back │ 12-08 > 05:36:47 │ 12-08 05:36:48 │ 12-08 05:36:57 ║ > ║ │ │ │ │ 20221208133222583 │ > │ │ ║ > ╟─┼───┼──┼───┼───┼┼┼╢ > MDT timeline: > -rw-r--r--@ 1 nsb staff 0 Dec 8 05:32 > 20221208133227028.deltacommit.requested > -rw-r--r--@ 1 nsb staff 548 Dec 8 05:32 > 20221208133227028.deltacommit.inflight > -rw-r--r--@ 1 nsb staff 6042 Dec 8 05:32 20221208133227028.deltacommit > -rw-r--r--@ 1 nsb staff 1938 Dec 8 05:34 > 20221208133227028001.compaction.requested > -rw-r--r--@ 1 nsb staff 0 Dec 8 05:34 > 20221208133227028001.compaction.inflight > -rw-r--r--@ 1 nsb staff 7556 Dec 8 05:34 20221208133227028001.commit > -rw-r--r--@ 1 nsb staff 0 Dec 8 05:34 > 20221208132316380.deltacommit.requested > -rw-r--r--@ 1 nsb staff 3049 Dec 8 05:34 > 20221208132316380.deltacommit.inflight > -rw-r--r--@ 1 nsb staff 8207 Dec 8 05:35 20221208132316380.deltacommit > -rw-r--r--@ 1 nsb staff 0 Dec 8 05:36 > 20221208133647230.deltacommit.requested > -rw-r--r--@ 1 nsb staff 548 Dec 8 05:36 > 20221208133647230.deltacommit.inflight > -rw-r--r--@ 1 nsb staff 6042 Dec 8 05:36 20221208133647230.deltacommit > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6443) Support insert_overwrite and insert_overwrite_table with RLI
[ https://issues.apache.org/jira/browse/HUDI-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6443: - Labels: pull-request-available (was: ) > Support insert_overwrite and insert_overwrite_table with RLI > > > Key: HUDI-6443 > URL: https://issues.apache.org/jira/browse/HUDI-6443 > Project: Apache Hudi > Issue Type: Improvement > Components: index, metadata >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] xushiyan opened a new pull request, #9055: [HUDI-6443] Support insert_overwrite/table with record-level index
xushiyan opened a new pull request, #9055: URL: https://github.com/apache/hudi/pull/9055 ### Change Logs Support `insert_overwrite` and `insert_overwrite_table` with record-level index. The metadata records should be updated accordingly. - newly inserted records should be present in RLI - old records in the affected partitions should be removed from RLI - old records that happen to have the same record key as the new inserts won't be removed from RLI; their entries will be updated ### Impact RLI data integrity ### Risk level Medium - [ ] UT, FT and e2e testing. ### Documentation Update NA ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zaza commented on issue #9032: [SUPPORT] NPE when reading from a CDC-enabled table but written with BULK_INSERT
zaza commented on issue #9032: URL: https://github.com/apache/hudi/issues/9032#issuecomment-1607596535 Hi @ad1happy2go thanks for giving it a go. I followed your setup and it did work for me as well. After taking a deeper dive into our tables and what's in them we realized some of our records have _null values_ (with the field marked as nullable in the schema). It doesn't seem like any of the records from DataGenerator have empty fields, but would you mind trying your example with that in mind? Once confirmed that null values are the culprit here I will update the summary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5071: [HUDI-1881]: draft implementation for trigger based on data availability
hudi-bot commented on PR #5071: URL: https://github.com/apache/hudi/pull/5071#issuecomment-1607567423 ## CI report: * b7203e6d2d6f1e8d3121024faedfa2da1ccc0c71 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7088) * 518758403252fd03ca77eb8977dda217575efecc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6443) Support insert_overwrite and insert_overwrite_table with RLI
[ https://issues.apache.org/jira/browse/HUDI-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-6443: - Priority: Blocker (was: Major) > Support insert_overwrite and insert_overwrite_table with RLI > > > Key: HUDI-6443 > URL: https://issues.apache.org/jira/browse/HUDI-6443 > Project: Apache Hudi > Issue Type: Improvement > Components: index, metadata >Reporter: Raymond Xu >Priority: Blocker > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6443) Support insert_overwrite and insert_overwrite_table with RLI
[ https://issues.apache.org/jira/browse/HUDI-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu reassigned HUDI-6443: Assignee: Raymond Xu > Support insert_overwrite and insert_overwrite_table with RLI > > > Key: HUDI-6443 > URL: https://issues.apache.org/jira/browse/HUDI-6443 > Project: Apache Hudi > Issue Type: Improvement > Components: index, metadata >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6443) Support insert_overwrite and insert_overwrite_table with RLI
[ https://issues.apache.org/jira/browse/HUDI-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-6443: - Fix Version/s: 0.14.0 > Support insert_overwrite and insert_overwrite_table with RLI > > > Key: HUDI-6443 > URL: https://issues.apache.org/jira/browse/HUDI-6443 > Project: Apache Hudi > Issue Type: Improvement > Components: index, metadata >Reporter: Raymond Xu >Priority: Major > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6444) Support delete and delete_partition with RLI
Raymond Xu created HUDI-6444: Summary: Support delete and delete_partition with RLI Key: HUDI-6444 URL: https://issues.apache.org/jira/browse/HUDI-6444 Project: Apache Hudi Issue Type: Improvement Components: index, metadata Reporter: Raymond Xu Fix For: 0.14.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6443) Support insert_overwrite and insert_overwrite_table with RLI
Raymond Xu created HUDI-6443: Summary: Support insert_overwrite and insert_overwrite_table with RLI Key: HUDI-6443 URL: https://issues.apache.org/jira/browse/HUDI-6443 Project: Apache Hudi Issue Type: Improvement Components: index, metadata Reporter: Raymond Xu -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9049: [HUDI-6435] Add some logs to the updateHeartbeat method
hudi-bot commented on PR #9049: URL: https://github.com/apache/hudi/pull/9049#issuecomment-1607556925 ## CI report: * 7b7a50f0d0f4abd5fb36c06bd078ccef1b1cb76d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18113) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-6369) Spacial curve with sample strategy fails when 0 or 1 rows only is incoming
[ https://issues.apache.org/jira/browse/HUDI-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris reassigned HUDI-6369: --- Assignee: nicolas paris > Spacial curve with sample strategy fails when 0 or 1 rows only is incoming > -- > > Key: HUDI-6369 > URL: https://issues.apache.org/jira/browse/HUDI-6369 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Aditya Goenka >Assignee: nicolas paris >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > Github Issue - [https://github.com/apache/hudi/issues/8934] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9053: [HUDI-6369] Fix spacial curve with sample strategy fails when 0 or 1 rows only is incoming
hudi-bot commented on PR #9053: URL: https://github.com/apache/hudi/pull/9053#issuecomment-1607440001 ## CI report: * bf5569721d0a4d7019d1897c3af941031c3a3d30 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18112) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6438) Fix issue while inserting non-nullable array columns to nullable columns
[ https://issues.apache.org/jira/browse/HUDI-6438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditya Goenka updated HUDI-6438: Priority: Critical (was: Major) > Fix issue while inserting non-nullable array columns to nullable columns > > > Key: HUDI-6438 > URL: https://issues.apache.org/jira/browse/HUDI-6438 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Aditya Goenka >Priority: Critical > Fix For: 0.14.0 > > > Github issue - [https://github.com/apache/hudi/issues/9042] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9054: [HUDI-6442] TestPartialUpdateForMergeInto should support mor
hudi-bot commented on PR #9054: URL: https://github.com/apache/hudi/pull/9054#issuecomment-1607351428 ## CI report: * 3819ebe617f8338430fc1d1058f7e3938a6770e8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18114) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9054: [HUDI-6442] TestPartialUpdateForMergeInto should support mor
hudi-bot commented on PR #9054: URL: https://github.com/apache/hudi/pull/9054#issuecomment-1607338553 ## CI report: * 3819ebe617f8338430fc1d1058f7e3938a6770e8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7343: [HUDI-5303] Allow users to control the concurrency to submit jobs in clustering
hudi-bot commented on PR #7343: URL: https://github.com/apache/hudi/pull/7343#issuecomment-1607315899 ## CI report: * 372cdaea808b0e17ef4868323a673dc3a15be1aa Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18107) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6442) TestPartialUpdateForMergeInto should support mor
[ https://issues.apache.org/jira/browse/HUDI-6442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6442: - Labels: pull-request-available (was: ) > TestPartialUpdateForMergeInto should support mor > > > Key: HUDI-6442 > URL: https://issues.apache.org/jira/browse/HUDI-6442 > Project: Apache Hudi > Issue Type: Bug > Components: tests-ci >Reporter: xy >Assignee: xy >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] ad1happy2go commented on issue #9032: [SUPPORT] NPE when reading from a CDC-enabled table but written with BULK_INSERT
ad1happy2go commented on issue #9032: URL: https://github.com/apache/hudi/issues/9032#issuecomment-1607307441 @zaza Also, can you share your full table configuration. That might help me to reproduce this error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xuzifu666 opened a new pull request, #9054: [HUDI-6442] TestPartialUpdateForMergeInto should support mor
xuzifu666 opened a new pull request, #9054: URL: https://github.com/apache/hudi/pull/9054 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ none ### Impact _Describe any public API or user-facing feature change or any performance impact._ none ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ TestPartialUpdateForMergeInto should support mor ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6442) TestPartialUpdateForMergeInto should support mor
xy created HUDI-6442: Summary: TestPartialUpdateForMergeInto should support mor Key: HUDI-6442 URL: https://issues.apache.org/jira/browse/HUDI-6442 Project: Apache Hudi Issue Type: Bug Components: tests-ci Reporter: xy Assignee: xy Fix For: 0.14.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] ad1happy2go commented on issue #9032: [SUPPORT] NPE when reading from a CDC-enabled table but written with BULK_INSERT
ad1happy2go commented on issue #9032: URL: https://github.com/apache/hudi/issues/9032#issuecomment-1607304355 @zaza I tried to reproduce the issue with hudi 0.13.1, but I am seeing the appropriate behaviour only for bulk insert too. All CDC rows are coming as inserts which is expected for bulk insert. Can you let me know on exactly what scenario you are getting NullPointerException? Is it intermittent? Code I tried - ``` val path="file:///tmp/output/issue_9032_4" val dataGen = new DataGenerator val inserts = convertToStringList(dataGen.generateInserts(10)) val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) val options = Map( "hoodie.table.name" -> "line_items", "hoodie.datasource.write.recordkey.field" -> "uuid", "hoodie.datasource.write.precombine.field" -> "ts", "hoodie.datasource.write.partitionpath.field" -> "partitionpath", "hoodie.parquet.max.file.size" -> "125829120", "hoodie.parquet.small.file.limit" -> "104857600", "hoodie.index.type" -> "BLOOM", "hoodie.bloom.index.use.metadata" -> "true", "hoodie.cleaner.policy" -> "KEEP_LATEST_COMMITS", "hoodie.cleaner.commits.retained" -> "168", "hoodie.keep.min.commits" -> "173", "hoodie.keep.max.commits" -> "174" ) df.write.format("hudi").options(options).option(DataSourceWriteOptions.OPERATION.key, "bulk_insert").option(HoodieTableConfig.NAME.key(), "line_items") .option(HoodieTableConfig.CDC_ENABLED.key, "true") .option(HoodieTableConfig.CDC_SUPPLEMENTAL_LOGGING_MODE.key, HoodieCDCSupplementalLoggingMode.data_before_after.name()) .mode("append") .save(path); spark.readStream.format("hudi").option("hoodie.datasource.query.incremental.format", "cdc").option("hoodie.datasource.query.type", "incremental") .load(path) .writeStream.foreachBatch { (batch: Dataset[Row], _: Long) => batch.show(false); }.start.awaitTermination; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] joe-shad commented on pull request #5071: [HUDI-1881]: draft implementation for trigger based on data availability
joe-shad commented on PR #5071: URL: https://github.com/apache/hudi/pull/5071#issuecomment-1607260369 I'm waiting for this PR (or any possible solution to the continuous mode for MultiTableDeltaStreamer) as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9049: [HUDI-6435] Add some logs to the updateHeartbeat method
hudi-bot commented on PR #9049: URL: https://github.com/apache/hudi/pull/9049#issuecomment-1607256682 ## CI report: * 8960860b33c4b0a0016d8ee718525cb58f0a6959 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18099) * 7b7a50f0d0f4abd5fb36c06bd078ccef1b1cb76d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18113) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9049: [HUDI-6435] Add some logs to the updateHeartbeat method
hudi-bot commented on PR #9049: URL: https://github.com/apache/hudi/pull/9049#issuecomment-1607235976 ## CI report: * 8960860b33c4b0a0016d8ee718525cb58f0a6959 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18099) * 7b7a50f0d0f4abd5fb36c06bd078ccef1b1cb76d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ad1happy2go commented on issue #8919: [SUPPORT] Hudi Stored Procedure show clustering fails on AWS Glue 4.0
ad1happy2go commented on issue #8919: URL: https://github.com/apache/hudi/issues/8919#issuecomment-1607224457 @soumilshah1995 I am able to successfully run clustering with your code. The third block for show clustering fails as expected as it tried to find the table name and we are passing the path. Can you clarify more when are you seeing this error - java.util.NoSuchElementException: No value present in Option. I didn't hit this error with Glue 4.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan closed issue #8834: [SUPPORT] Push Hudi Commit Notification TO HTTP URI with Callback | Passing Custom Headers ?
xushiyan closed issue #8834: [SUPPORT] Push Hudi Commit Notification TO HTTP URI with Callback | Passing Custom Headers ? URL: https://github.com/apache/hudi/issues/8834 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ad1happy2go commented on issue #8834: [SUPPORT] Push Hudi Commit Notification TO HTTP URI with Callback | Passing Custom Headers ?
ad1happy2go commented on issue #8834: URL: https://github.com/apache/hudi/issues/8834#issuecomment-1607183534 @soumilshah1995 Thanks for raising this. Hudi dont have anyway to pass custom header as of moment. Created JIRA to track this improvement - https://issues.apache.org/jira/browse/HUDI-6441 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6441) Passing custom Headers with Hudi Callback URL
Aditya Goenka created HUDI-6441: --- Summary: Passing custom Headers with Hudi Callback URL Key: HUDI-6441 URL: https://issues.apache.org/jira/browse/HUDI-6441 Project: Apache Hudi Issue Type: Improvement Components: writer-core Reporter: Aditya Goenka Fix For: 1.0.0 Hudi callback URL's doesn't support passing the custom headers as of now. Implement a way to pass them and use it for callback. Github Issue - [https://github.com/apache/hudi/issues/8834] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] guanziyue commented on pull request #9052: [HUDI-6439] DirectWriteMarkers create file need judge appendfile whether exist
guanziyue commented on PR #9052: URL: https://github.com/apache/hudi/pull/9052#issuecomment-1607173922 May I know if this still occur after HUDI-6401 is merged? And if so, could you also share the stacktrace including HoodieWriteHandle?? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant
danny0405 commented on code in PR #9038: URL: https://github.com/apache/hudi/pull/9038#discussion_r1241963611 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java: ## @@ -267,6 +267,13 @@ public HoodieTimeline getCommitsTimeline() { return getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, DELTA_COMMIT_ACTION, REPLACE_COMMIT_ACTION)); } + /** + * Get all instants (commits, delta commits, replace, compaction) that produce new data or merge file, in the active timeline. + */ + public HoodieTimeline getCommitsAndMergesTimeline() { +return getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, DELTA_COMMIT_ACTION, REPLACE_COMMIT_ACTION, COMPACTION_ACTION)); + } Review Comment: getCommitsAndMergesTimeline -> getCommitsAndCompactionTimeline Can we also add a test case for this incremental cleaning scenario, where partition path got switched and the old partition files could not be cleaned. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant
danny0405 commented on code in PR #9038: URL: https://github.com/apache/hudi/pull/9038#discussion_r1241963611 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java: ## @@ -267,6 +267,13 @@ public HoodieTimeline getCommitsTimeline() { return getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, DELTA_COMMIT_ACTION, REPLACE_COMMIT_ACTION)); } + /** + * Get all instants (commits, delta commits, replace, compaction) that produce new data or merge file, in the active timeline. + */ + public HoodieTimeline getCommitsAndMergesTimeline() { +return getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, DELTA_COMMIT_ACTION, REPLACE_COMMIT_ACTION, COMPACTION_ACTION)); + } Review Comment: getCommitsAndMergesTimeline -> getCommitsAndCompactionTimeline -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ad1happy2go commented on issue #8984: Offline compaction schedule failing with Error fetching partition paths from metadata table
ad1happy2go commented on issue #8984: URL: https://github.com/apache/hudi/issues/8984#issuecomment-1607160179 @koochiswathiTR I dont think there is something like that which unschedule the compaction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9053: [HUDI-6369] Fix spacial curve with sample strategy fails when 0 or 1 rows only is incoming
hudi-bot commented on PR #9053: URL: https://github.com/apache/hudi/pull/9053#issuecomment-1607157136 ## CI report: * bf5569721d0a4d7019d1897c3af941031c3a3d30 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18112) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan closed issue #8906: [SUPPORT] hudi upsert error: java.lang.NumberFormatException: For input string: "d880d4ea"
xushiyan closed issue #8906: [SUPPORT] hudi upsert error: java.lang.NumberFormatException: For input string: "d880d4ea" URL: https://github.com/apache/hudi/issues/8906 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ad1happy2go commented on issue #8906: [SUPPORT] hudi upsert error: java.lang.NumberFormatException: For input string: "d880d4ea"
ad1happy2go commented on issue #8906: URL: https://github.com/apache/hudi/issues/8906#issuecomment-1607152244 @zyclove Looks like we have another one for tracking similar one - https://github.com/apache/hudi/issues/8986. Closing this one. Let us know in case of any concerns. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org