[GitHub] [hudi] danny0405 opened a new pull request, #9177: [HUDI-6526] Hoodie Hive catalog sync timestamp(3) as timestamp type into Hive
danny0405 opened a new pull request, #9177: URL: https://github.com/apache/hudi/pull/9177 ### Change Logs After HUDI-6307, timestamp(3) is also synced as timestamp type in Hive. ### Impact The table created through hive catalog would sync the timestamp(3) as timestamp in Hive. ### Risk level (write none, low medium or high below) none ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6526) Hoodie Hive catalog sync timestamp(3) as timestamp type into Hive
Danny Chen created HUDI-6526: Summary: Hoodie Hive catalog sync timestamp(3) as timestamp type into Hive Key: HUDI-6526 URL: https://issues.apache.org/jira/browse/HUDI-6526 Project: Apache Hudi Issue Type: Bug Components: flink-sql Reporter: Danny Chen -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6525) Syncing the view in Embedded TimelineServer can cause NullPointerException if MDT is enabled
[ https://issues.apache.org/jira/browse/HUDI-6525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason closed HUDI-6525. Resolution: Invalid > Syncing the view in Embedded TimelineServer can cause NullPointerException if > MDT is enabled > > > Key: HUDI-6525 > URL: https://issues.apache.org/jira/browse/HUDI-6525 > Project: Apache Hudi > Issue Type: Bug >Reporter: Prashant Wason >Assignee: Prashant Wason >Priority: Major > Labels: 0.14.0 > Fix For: 0.14.0 > > > Assume Embedded TimelineServer is being used and there are many executors > sending requests. > Executor 1: Sends request which causes TimelineServer to sync its view > syncIfLocalViewBehind() -> view.sync(); > If the view is HoodieMetadataFileSystemView, its sync is written as follows: > > public void sync() { > super.sync(); // -> REFRESHES TIMELINE ON TImelineServer > tableMetadata.reset(); // CLOSES MDT readers and OPENS them again > } > > The issue is that once super.sync() completes (almost immediately), the > requests from other executors wont detect the local view to be behind and > hence would not call syncIfLocalViewBehind(). They would directly start > calling functions on the view (e.g. view.getLatestFileSlices()). But until > MDT is read (log merging may take time when there are large number of log > blocks), view.getLatestFileSlices() will lead to NPE. > > The fix is to reverse the order of reset and sync in > HoodieMetadataFileSystemView. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6525) Syncing the view in Embedded TimelineServer can cause NullPointerException if MDT is enabled
[ https://issues.apache.org/jira/browse/HUDI-6525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742289#comment-17742289 ] Prashant Wason commented on HUDI-6525: -- This is not an issue for master branch due to read/write lock implemented in [https://github.com/apache/hudi/pull/8079.] I detected this issue in 0.10 build. > Syncing the view in Embedded TimelineServer can cause NullPointerException if > MDT is enabled > > > Key: HUDI-6525 > URL: https://issues.apache.org/jira/browse/HUDI-6525 > Project: Apache Hudi > Issue Type: Bug >Reporter: Prashant Wason >Assignee: Prashant Wason >Priority: Major > Labels: 0.14.0 > Fix For: 0.14.0 > > > Assume Embedded TimelineServer is being used and there are many executors > sending requests. > Executor 1: Sends request which causes TimelineServer to sync its view > syncIfLocalViewBehind() -> view.sync(); > If the view is HoodieMetadataFileSystemView, its sync is written as follows: > > public void sync() { > super.sync(); // -> REFRESHES TIMELINE ON TImelineServer > tableMetadata.reset(); // CLOSES MDT readers and OPENS them again > } > > The issue is that once super.sync() completes (almost immediately), the > requests from other executors wont detect the local view to be behind and > hence would not call syncIfLocalViewBehind(). They would directly start > calling functions on the view (e.g. view.getLatestFileSlices()). But until > MDT is read (log merging may take time when there are large number of log > blocks), view.getLatestFileSlices() will lead to NPE. > > The fix is to reverse the order of reset and sync in > HoodieMetadataFileSystemView. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.
hudi-bot commented on PR #9106: URL: https://github.com/apache/hudi/pull/9106#issuecomment-1631919853 ## CI report: * 2c07b3e13de51845aad4e280c5fb07688f103d4a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18500) * 16ae34ec0e91811bae11a980749f5b77d048adba Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18519) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6525) Syncing the view in Embedded TimelineServer can cause NullPointerException if MDT is enabled
Prashant Wason created HUDI-6525: Summary: Syncing the view in Embedded TimelineServer can cause NullPointerException if MDT is enabled Key: HUDI-6525 URL: https://issues.apache.org/jira/browse/HUDI-6525 Project: Apache Hudi Issue Type: Bug Reporter: Prashant Wason Assignee: Prashant Wason Fix For: 0.14.0 Assume Embedded TimelineServer is being used and there are many executors sending requests. Executor 1: Sends request which causes TimelineServer to sync its view syncIfLocalViewBehind() -> view.sync(); If the view is HoodieMetadataFileSystemView, its sync is written as follows: public void sync() { super.sync(); // -> REFRESHES TIMELINE ON TImelineServer tableMetadata.reset(); // CLOSES MDT readers and OPENS them again } The issue is that once super.sync() completes (almost immediately), the requests from other executors wont detect the local view to be behind and hence would not call syncIfLocalViewBehind(). They would directly start calling functions on the view (e.g. view.getLatestFileSlices()). But until MDT is read (log merging may take time when there are large number of log blocks), view.getLatestFileSlices() will lead to NPE. The fix is to reverse the order of reset and sync in HoodieMetadataFileSystemView. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.
hudi-bot commented on PR #9106: URL: https://github.com/apache/hudi/pull/9106#issuecomment-1631910270 ## CI report: * 2c07b3e13de51845aad4e280c5fb07688f103d4a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18500) * 16ae34ec0e91811bae11a980749f5b77d048adba UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8452: [HUDI-6077] Add more partition push down filters
hudi-bot commented on PR #8452: URL: https://github.com/apache/hudi/pull/8452#issuecomment-1631909019 ## CI report: * 8082df232089396b2a9f9be2b915e51b3645f172 UNKNOWN * 8429f89c5499d33074a6ed0039e5bd9a2da84024 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18483) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18518) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on pull request #8452: [HUDI-6077] Add more partition push down filters
boneanxs commented on PR #8452: URL: https://github.com/apache/hudi/pull/8452#issuecomment-1631902798 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope merged pull request #9168: [HUDI-6276] Rename HoodieDeltaStreamer to HoodieStreamer
codope merged PR #9168: URL: https://github.com/apache/hudi/pull/9168 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9136: [HUDI-6509] Add GitHub CI for Java 17
hudi-bot commented on PR #9136: URL: https://github.com/apache/hudi/pull/9136#issuecomment-1631868486 ## CI report: * a0e7207fb19738237d56fa0060c91cb7865ae9c0 UNKNOWN * cb101756f1bb906839b8f135b618f26205e022a9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18462) * 8ea8e3a853152fcc0d1cf4d0f53a38565ac33990 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18515) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17
hudi-bot commented on PR #8949: URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631868189 ## CI report: * 66acd28baa87819b949d165775556169243ccc3f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18511) * a0e53d8ef3f506afca5f25fd22adb70581fae426 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18516) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on a diff in pull request #9160: [HUDI-6501] HoodieHeartbeatClient should stop all heartbeats and not delete heartbeat files for close
SteNicholas commented on code in PR #9160: URL: https://github.com/apache/hudi/pull/9160#discussion_r1260602587 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/heartbeat/HoodieHeartbeatClient.java: ## @@ -185,36 +179,55 @@ public void start(String instantTime) { } /** - * Stops the heartbeat for the specified instant. - * @param instantTime + * Stops the heartbeat and deletes the heartbeat file for the specified instant. + * + * @param instantTime The instant time for the heartbeat. * @throws HoodieException */ public void stop(String instantTime) throws HoodieException { Heartbeat heartbeat = instantToHeartbeatMap.get(instantTime); -if (heartbeat != null && heartbeat.isHeartbeatStarted() && !heartbeat.isHeartbeatStopped()) { - LOG.info("Stopping heartbeat for instant " + instantTime); - heartbeat.getTimer().cancel(); - heartbeat.setHeartbeatStopped(true); - LOG.info("Stopped heartbeat for instant " + instantTime); +if (isHeartbeatStarted(heartbeat)) { + stopHeartbeatTimer(heartbeat); HeartbeatUtils.deleteHeartbeatFile(fs, basePath, instantTime); LOG.info("Deleted heartbeat file for instant " + instantTime); } } /** - * Stops all heartbeats started via this instance of the client. + * Stops all timers of heartbeats started via this instance of the client. + * * @throws HoodieException */ - public void stop() throws HoodieException { -instantToHeartbeatMap.values().forEach(heartbeat -> stop(heartbeat.getInstantTime())); + public void stopHeartbeatTimers() throws HoodieException { + instantToHeartbeatMap.values().stream().filter(this::isHeartbeatStarted).forEach(this::stopHeartbeatTimer); + } + + /** + * Whether the given heartbeat is started. + * + * @param heartbeat The heartbeat to check whether is started. + * @return Whether the heartbeat is started. + * @throws IOException + */ + private boolean isHeartbeatStarted(Heartbeat heartbeat) { +return heartbeat != null && heartbeat.isHeartbeatStarted() && !heartbeat.isHeartbeatStopped(); Review Comment: @leesf, `heartbeat != null && heartbeat.isHeartbeatStarted()` or `heartbeat != null && heartbeat.isHeartbeatStopped()` is not enough. Because `isHeartbeatStarted` is true and `isHeartbeatStopped` is true when heartbeat is stopped. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9136: [HUDI-6509] Add GitHub CI for Java 17
hudi-bot commented on PR #9136: URL: https://github.com/apache/hudi/pull/9136#issuecomment-1631862722 ## CI report: * a0e7207fb19738237d56fa0060c91cb7865ae9c0 UNKNOWN * cb101756f1bb906839b8f135b618f26205e022a9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18462) * 8ea8e3a853152fcc0d1cf4d0f53a38565ac33990 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17
hudi-bot commented on PR #8949: URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631862350 ## CI report: * 66acd28baa87819b949d165775556169243ccc3f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18511) * a0e53d8ef3f506afca5f25fd22adb70581fae426 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9145: [HUDI-6464] Codreview changes for Spark SQL Merge Into for pkless tables'
hudi-bot commented on PR #9145: URL: https://github.com/apache/hudi/pull/9145#issuecomment-1631856861 ## CI report: * 9e3c05bdf4ef8cb5bb800ccde85fde085b0d07af Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18506) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9131: [HUDI-6315] Feature flag for disabling optimized update/delete code path.
hudi-bot commented on PR #9131: URL: https://github.com/apache/hudi/pull/9131#issuecomment-1631856765 ## CI report: * c6e27b201d5fe4ca6eff34f0efdd5a8da5a3a45d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18505) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8832: [HUDI-6288] Create IngestionPrimaryWriterBasedConflictResolutionStrategy to prioritize ingestion writers over other writers
Zouxxyy commented on code in PR #8832: URL: https://github.com/apache/hudi/pull/8832#discussion_r1260558466 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/IngestionPrimaryWriterBasedConflictResolutionStrategy.java: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.client.transaction; + +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.util.ClusteringUtils; +import org.apache.hudi.common.util.CollectionUtils; +import org.apache.hudi.common.util.Option; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Comparator; +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMPACTION_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION; + +/** + * This class extends the base implementation of conflict resolution strategy. + * It gives preference to ingestion writers compared to table services. + */ +public class IngestionPrimaryWriterBasedConflictResolutionStrategy +extends SimpleConcurrentFileWritesConflictResolutionStrategy { + + private static final Logger LOG = LoggerFactory.getLogger(IngestionPrimaryWriterBasedConflictResolutionStrategy.class); + + /** + * For tableservices like replacecommit and compaction commits this method also returns ingestion inflight commits. + */ + @Override + public Stream getCandidateInstants(HoodieTableMetaClient metaClient, HoodieInstant currentInstant, +Option lastSuccessfulInstant) { +HoodieActiveTimeline activeTimeline = metaClient.reloadActiveTimeline(); +if ((REPLACE_COMMIT_ACTION.equals(currentInstant.getAction()) + && ClusteringUtils.isClusteringCommit(metaClient, currentInstant)) +|| COMPACTION_ACTION.equals(currentInstant.getAction())) { + return getCandidateInstantsForTableServicesCommits(activeTimeline, currentInstant); +} else { + return getCandidateInstantsForNonTableServicesCommits(activeTimeline, currentInstant); +} + } + + private Stream getCandidateInstantsForNonTableServicesCommits(HoodieActiveTimeline activeTimeline, HoodieInstant currentInstant) { + +// To findout which instants are conflicting, we apply the following logic +// Get all the completed instants timeline only for commits that have happened +// since the last successful write based on the transition times. +// We need to check for write conflicts since they may have mutated the same files +// that are being newly created by the current write. +List completedCommitsInstants = activeTimeline +.getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, REPLACE_COMMIT_ACTION, COMPACTION_ACTION, DELTA_COMMIT_ACTION)) +.filterCompletedInstants() + .findInstantsModifiedAfterByStateTransitionTime(currentInstant.getTimestamp()) +.getInstantsOrderedByStateTransitionTime() +.collect(Collectors.toList()); +LOG.info(String.format("Instants that may have conflict with %s are %s", currentInstant, completedCommitsInstants)); +return completedCommitsInstants.stream(); + } + + /** + * To find which instants are conflicting, we apply the following logic + * Get both completed instants and ingestion inflight commits that have happened since the last successful write. + * We need to check for write conflicts since they may have mutated the same files + * that are being newly created by the current write. + */ + private Stream getCandidateInstantsForTableServicesCommits(HoodieActiveTimeline activeTimeline, HoodieInstant currentInstant) { +// Fetch list of completed commits. +Stream completedCommitsStream = +activeT
[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8832: [HUDI-6288] Create IngestionPrimaryWriterBasedConflictResolutionStrategy to prioritize ingestion writers over other writers
Zouxxyy commented on code in PR #8832: URL: https://github.com/apache/hudi/pull/8832#discussion_r1260558466 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/IngestionPrimaryWriterBasedConflictResolutionStrategy.java: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.client.transaction; + +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.util.ClusteringUtils; +import org.apache.hudi.common.util.CollectionUtils; +import org.apache.hudi.common.util.Option; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Comparator; +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMPACTION_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION; + +/** + * This class extends the base implementation of conflict resolution strategy. + * It gives preference to ingestion writers compared to table services. + */ +public class IngestionPrimaryWriterBasedConflictResolutionStrategy +extends SimpleConcurrentFileWritesConflictResolutionStrategy { + + private static final Logger LOG = LoggerFactory.getLogger(IngestionPrimaryWriterBasedConflictResolutionStrategy.class); + + /** + * For tableservices like replacecommit and compaction commits this method also returns ingestion inflight commits. + */ + @Override + public Stream getCandidateInstants(HoodieTableMetaClient metaClient, HoodieInstant currentInstant, +Option lastSuccessfulInstant) { +HoodieActiveTimeline activeTimeline = metaClient.reloadActiveTimeline(); +if ((REPLACE_COMMIT_ACTION.equals(currentInstant.getAction()) + && ClusteringUtils.isClusteringCommit(metaClient, currentInstant)) +|| COMPACTION_ACTION.equals(currentInstant.getAction())) { + return getCandidateInstantsForTableServicesCommits(activeTimeline, currentInstant); +} else { + return getCandidateInstantsForNonTableServicesCommits(activeTimeline, currentInstant); +} + } + + private Stream getCandidateInstantsForNonTableServicesCommits(HoodieActiveTimeline activeTimeline, HoodieInstant currentInstant) { + +// To findout which instants are conflicting, we apply the following logic +// Get all the completed instants timeline only for commits that have happened +// since the last successful write based on the transition times. +// We need to check for write conflicts since they may have mutated the same files +// that are being newly created by the current write. +List completedCommitsInstants = activeTimeline +.getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, REPLACE_COMMIT_ACTION, COMPACTION_ACTION, DELTA_COMMIT_ACTION)) +.filterCompletedInstants() + .findInstantsModifiedAfterByStateTransitionTime(currentInstant.getTimestamp()) +.getInstantsOrderedByStateTransitionTime() +.collect(Collectors.toList()); +LOG.info(String.format("Instants that may have conflict with %s are %s", currentInstant, completedCommitsInstants)); +return completedCommitsInstants.stream(); + } + + /** + * To find which instants are conflicting, we apply the following logic + * Get both completed instants and ingestion inflight commits that have happened since the last successful write. + * We need to check for write conflicts since they may have mutated the same files + * that are being newly created by the current write. + */ + private Stream getCandidateInstantsForTableServicesCommits(HoodieActiveTimeline activeTimeline, HoodieInstant currentInstant) { +// Fetch list of completed commits. +Stream completedCommitsStream = +activeT
[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8832: [HUDI-6288] Create IngestionPrimaryWriterBasedConflictResolutionStrategy to prioritize ingestion writers over other writers
Zouxxyy commented on code in PR #8832: URL: https://github.com/apache/hudi/pull/8832#discussion_r1260558466 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/IngestionPrimaryWriterBasedConflictResolutionStrategy.java: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.client.transaction; + +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.util.ClusteringUtils; +import org.apache.hudi.common.util.CollectionUtils; +import org.apache.hudi.common.util.Option; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Comparator; +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMPACTION_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION; + +/** + * This class extends the base implementation of conflict resolution strategy. + * It gives preference to ingestion writers compared to table services. + */ +public class IngestionPrimaryWriterBasedConflictResolutionStrategy +extends SimpleConcurrentFileWritesConflictResolutionStrategy { + + private static final Logger LOG = LoggerFactory.getLogger(IngestionPrimaryWriterBasedConflictResolutionStrategy.class); + + /** + * For tableservices like replacecommit and compaction commits this method also returns ingestion inflight commits. + */ + @Override + public Stream getCandidateInstants(HoodieTableMetaClient metaClient, HoodieInstant currentInstant, +Option lastSuccessfulInstant) { +HoodieActiveTimeline activeTimeline = metaClient.reloadActiveTimeline(); +if ((REPLACE_COMMIT_ACTION.equals(currentInstant.getAction()) + && ClusteringUtils.isClusteringCommit(metaClient, currentInstant)) +|| COMPACTION_ACTION.equals(currentInstant.getAction())) { + return getCandidateInstantsForTableServicesCommits(activeTimeline, currentInstant); +} else { + return getCandidateInstantsForNonTableServicesCommits(activeTimeline, currentInstant); +} + } + + private Stream getCandidateInstantsForNonTableServicesCommits(HoodieActiveTimeline activeTimeline, HoodieInstant currentInstant) { + +// To findout which instants are conflicting, we apply the following logic +// Get all the completed instants timeline only for commits that have happened +// since the last successful write based on the transition times. +// We need to check for write conflicts since they may have mutated the same files +// that are being newly created by the current write. +List completedCommitsInstants = activeTimeline +.getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, REPLACE_COMMIT_ACTION, COMPACTION_ACTION, DELTA_COMMIT_ACTION)) +.filterCompletedInstants() + .findInstantsModifiedAfterByStateTransitionTime(currentInstant.getTimestamp()) +.getInstantsOrderedByStateTransitionTime() +.collect(Collectors.toList()); +LOG.info(String.format("Instants that may have conflict with %s are %s", currentInstant, completedCommitsInstants)); +return completedCommitsInstants.stream(); + } + + /** + * To find which instants are conflicting, we apply the following logic + * Get both completed instants and ingestion inflight commits that have happened since the last successful write. + * We need to check for write conflicts since they may have mutated the same files + * that are being newly created by the current write. + */ + private Stream getCandidateInstantsForTableServicesCommits(HoodieActiveTimeline activeTimeline, HoodieInstant currentInstant) { +// Fetch list of completed commits. +Stream completedCommitsStream = +activeT
[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8832: [HUDI-6288] Create IngestionPrimaryWriterBasedConflictResolutionStrategy to prioritize ingestion writers over other writers
Zouxxyy commented on code in PR #8832: URL: https://github.com/apache/hudi/pull/8832#discussion_r1260558466 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/IngestionPrimaryWriterBasedConflictResolutionStrategy.java: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.client.transaction; + +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.util.ClusteringUtils; +import org.apache.hudi.common.util.CollectionUtils; +import org.apache.hudi.common.util.Option; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Comparator; +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMPACTION_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION; + +/** + * This class extends the base implementation of conflict resolution strategy. + * It gives preference to ingestion writers compared to table services. + */ +public class IngestionPrimaryWriterBasedConflictResolutionStrategy +extends SimpleConcurrentFileWritesConflictResolutionStrategy { + + private static final Logger LOG = LoggerFactory.getLogger(IngestionPrimaryWriterBasedConflictResolutionStrategy.class); + + /** + * For tableservices like replacecommit and compaction commits this method also returns ingestion inflight commits. + */ + @Override + public Stream getCandidateInstants(HoodieTableMetaClient metaClient, HoodieInstant currentInstant, +Option lastSuccessfulInstant) { +HoodieActiveTimeline activeTimeline = metaClient.reloadActiveTimeline(); +if ((REPLACE_COMMIT_ACTION.equals(currentInstant.getAction()) + && ClusteringUtils.isClusteringCommit(metaClient, currentInstant)) +|| COMPACTION_ACTION.equals(currentInstant.getAction())) { + return getCandidateInstantsForTableServicesCommits(activeTimeline, currentInstant); +} else { + return getCandidateInstantsForNonTableServicesCommits(activeTimeline, currentInstant); +} + } + + private Stream getCandidateInstantsForNonTableServicesCommits(HoodieActiveTimeline activeTimeline, HoodieInstant currentInstant) { + +// To findout which instants are conflicting, we apply the following logic +// Get all the completed instants timeline only for commits that have happened +// since the last successful write based on the transition times. +// We need to check for write conflicts since they may have mutated the same files +// that are being newly created by the current write. +List completedCommitsInstants = activeTimeline +.getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, REPLACE_COMMIT_ACTION, COMPACTION_ACTION, DELTA_COMMIT_ACTION)) +.filterCompletedInstants() + .findInstantsModifiedAfterByStateTransitionTime(currentInstant.getTimestamp()) +.getInstantsOrderedByStateTransitionTime() +.collect(Collectors.toList()); +LOG.info(String.format("Instants that may have conflict with %s are %s", currentInstant, completedCommitsInstants)); +return completedCommitsInstants.stream(); + } + + /** + * To find which instants are conflicting, we apply the following logic + * Get both completed instants and ingestion inflight commits that have happened since the last successful write. + * We need to check for write conflicts since they may have mutated the same files + * that are being newly created by the current write. + */ + private Stream getCandidateInstantsForTableServicesCommits(HoodieActiveTimeline activeTimeline, HoodieInstant currentInstant) { +// Fetch list of completed commits. +Stream completedCommitsStream = +activeT
[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8832: [HUDI-6288] Create IngestionPrimaryWriterBasedConflictResolutionStrategy to prioritize ingestion writers over other writers
Zouxxyy commented on code in PR #8832: URL: https://github.com/apache/hudi/pull/8832#discussion_r1260558466 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/IngestionPrimaryWriterBasedConflictResolutionStrategy.java: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.client.transaction; + +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.util.ClusteringUtils; +import org.apache.hudi.common.util.CollectionUtils; +import org.apache.hudi.common.util.Option; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Comparator; +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMPACTION_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION; + +/** + * This class extends the base implementation of conflict resolution strategy. + * It gives preference to ingestion writers compared to table services. + */ +public class IngestionPrimaryWriterBasedConflictResolutionStrategy +extends SimpleConcurrentFileWritesConflictResolutionStrategy { + + private static final Logger LOG = LoggerFactory.getLogger(IngestionPrimaryWriterBasedConflictResolutionStrategy.class); + + /** + * For tableservices like replacecommit and compaction commits this method also returns ingestion inflight commits. + */ + @Override + public Stream getCandidateInstants(HoodieTableMetaClient metaClient, HoodieInstant currentInstant, +Option lastSuccessfulInstant) { +HoodieActiveTimeline activeTimeline = metaClient.reloadActiveTimeline(); +if ((REPLACE_COMMIT_ACTION.equals(currentInstant.getAction()) + && ClusteringUtils.isClusteringCommit(metaClient, currentInstant)) +|| COMPACTION_ACTION.equals(currentInstant.getAction())) { + return getCandidateInstantsForTableServicesCommits(activeTimeline, currentInstant); +} else { + return getCandidateInstantsForNonTableServicesCommits(activeTimeline, currentInstant); +} + } + + private Stream getCandidateInstantsForNonTableServicesCommits(HoodieActiveTimeline activeTimeline, HoodieInstant currentInstant) { + +// To findout which instants are conflicting, we apply the following logic +// Get all the completed instants timeline only for commits that have happened +// since the last successful write based on the transition times. +// We need to check for write conflicts since they may have mutated the same files +// that are being newly created by the current write. +List completedCommitsInstants = activeTimeline +.getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, REPLACE_COMMIT_ACTION, COMPACTION_ACTION, DELTA_COMMIT_ACTION)) +.filterCompletedInstants() + .findInstantsModifiedAfterByStateTransitionTime(currentInstant.getTimestamp()) +.getInstantsOrderedByStateTransitionTime() +.collect(Collectors.toList()); +LOG.info(String.format("Instants that may have conflict with %s are %s", currentInstant, completedCommitsInstants)); +return completedCommitsInstants.stream(); + } + + /** + * To find which instants are conflicting, we apply the following logic + * Get both completed instants and ingestion inflight commits that have happened since the last successful write. + * We need to check for write conflicts since they may have mutated the same files + * that are being newly created by the current write. + */ + private Stream getCandidateInstantsForTableServicesCommits(HoodieActiveTimeline activeTimeline, HoodieInstant currentInstant) { +// Fetch list of completed commits. +Stream completedCommitsStream = +activeT
[GitHub] [hudi] hudi-bot commented on pull request #9175: [MINOR] StreamWriteOperatorCoordinator stop current instant heartbeat…
hudi-bot commented on PR #9175: URL: https://github.com/apache/hudi/pull/9175#issuecomment-1631826023 ## CI report: * 98654ed70f92bcbe239b999e733b7aa79c296faf Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18513) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9176: [HUDI-6523] fix get valid checkpoint for current writer
hudi-bot commented on PR #9176: URL: https://github.com/apache/hudi/pull/9176#issuecomment-1631826058 ## CI report: * 4e765c59cbf491b4a593e77b0951d9ab74b516bd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18514) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17
hudi-bot commented on PR #8949: URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631825658 ## CI report: * 66acd28baa87819b949d165775556169243ccc3f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18511) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8610: [HUDI-6156] Prevent leaving tmp file in timeline when multi process t…
hudi-bot commented on PR #8610: URL: https://github.com/apache/hudi/pull/8610#issuecomment-1631825131 ## CI report: * f34ffd6ccf4fd366ade5dad8487ff9a0a248bec8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16758) * 0bd8531ac13126771f84db19fa02fbe1828b762d UNKNOWN * bfc9450676a0742bad6e25ee1e56aa240e31fe74 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18512) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9176: [HUDI-6523] fix get valid checkpoint for current writer
hudi-bot commented on PR #9176: URL: https://github.com/apache/hudi/pull/9176#issuecomment-1631820866 ## CI report: * 4e765c59cbf491b4a593e77b0951d9ab74b516bd UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9175: [MINOR] StreamWriteOperatorCoordinator stop current instant heartbeat…
hudi-bot commented on PR #9175: URL: https://github.com/apache/hudi/pull/9175#issuecomment-1631820842 ## CI report: * 98654ed70f92bcbe239b999e733b7aa79c296faf UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8610: [HUDI-6156] Prevent leaving tmp file in timeline when multi process t…
hudi-bot commented on PR #8610: URL: https://github.com/apache/hudi/pull/8610#issuecomment-1631819960 ## CI report: * f34ffd6ccf4fd366ade5dad8487ff9a0a248bec8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16758) * 0bd8531ac13126771f84db19fa02fbe1828b762d UNKNOWN * bfc9450676a0742bad6e25ee1e56aa240e31fe74 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9173: Dynamic Partition Pruning Port
hudi-bot commented on PR #9173: URL: https://github.com/apache/hudi/pull/9173#issuecomment-1631815699 ## CI report: * 0d0d0cde8aa18c5e93ce256208ea3b27b210a815 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18502) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17
hudi-bot commented on PR #8949: URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631815308 ## CI report: * 28e822b1a4b83751727dbca0d3d632ca9ca2ac3e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18509) * 66acd28baa87819b949d165775556169243ccc3f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18511) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] voonhous commented on pull request #9068: [HUDI-6447] Unify operation in compaction and clustering procedure
voonhous commented on PR #9068: URL: https://github.com/apache/hudi/pull/9068#issuecomment-1631813308 @Zouxxyy Do you mean: > For example, ~compaction~ clustering procedure only supports schedule and run (do both schedule and execute), execute only is not supported,... Got a little confused I read the description. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on a diff in pull request #9175: [MINOR] StreamWriteOperatorCoordinator stop current instant heartbeat…
SteNicholas commented on code in PR #9175: URL: https://github.com/apache/hudi/pull/9175#discussion_r1260538162 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java: ## @@ -362,6 +362,10 @@ private void initClientIds(Configuration conf) { private void reset() { this.eventBuffer = new WriteMetadataEvent[this.parallelism]; +if (!this.instant.equals(WriteMetadataEvent.BOOTSTRAP_INSTANT)) { + writeClient.getHeartbeatClient().stop(instant); Review Comment: This should check whether the failed writes clean policy is lazy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6523) Fix get valid checkpoint for current writer
[ https://issues.apache.org/jira/browse/HUDI-6523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6523: - Labels: pull-request-available (was: ) > Fix get valid checkpoint for current writer > --- > > Key: HUDI-6523 > URL: https://issues.apache.org/jira/browse/HUDI-6523 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Reporter: eric >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > 23/07/11 16:50:57 INFO HoodieCompactor: Compactor compacting > [CompactionOperation{baseInstantTime='20230711165027926', > dataFileCommitTime=Option{val=20230711165027926}, > deltaFileNames=[.0001-2bae-40d8-8038-401eefb9e7e3-0_20230711165027926.log.1_1-53-670], > > dataFileName=Option{val=0001-2bae-40d8-8038-401eefb9e7e3-0_1-27-432_20230711165027926.parquet}, > id='HoodieFileGroupId{partitionPath='part=2023071116', > fileId='0001-2bae-40d8-8038-401eefb9e7e3-0'}', > metrics={TOTAL_LOG_FILES=1.0, TOTAL_IO_READ_MB=0.0, > TOTAL_LOG_FILES_SIZE=3976.0, TOTAL_IO_WRITE_MB=0.0, TOTAL_IO_MB=0.0}, > bootstrapFilePath=Optional.empty}, > CompactionOperation{baseInstantTime='20230711165027926', > dataFileCommitTime=Option{val=20230711165027926}, > deltaFileNames=[.-eb61-4788-a9cb-aaa67e2e47c4-0_20230711165027926.log.1_0-53-671], > > dataFileName=Option{val=-eb61-4788-a9cb-aaa67e2e47c4-0_0-27-431_20230711165027926.parquet}, > id='HoodieFileGroupId{partitionPath='part=2023071116', > fileId='-eb61-4788-a9cb-aaa67e2e47c4-0'}', > metrics={TOTAL_LOG_FILES=1.0, TOTAL_IO_READ_MB=0.0, > TOTAL_LOG_FILES_SIZE=3592.0, TOTAL_IO_WRITE_MB=0.0, TOTAL_IO_MB=0.0}, > bootstrapFilePath=Optional.empty}, > CompactionOperation{baseInstantTime='20230711165027926', > dataFileCommitTime=Option{val=20230711165027926}, > deltaFileNames=[.0002-7160-4515-a0a6-7bcf2e03-0_20230711165027926.log.1_2-53-673], > > dataFileName=Option{val=0002-7160-4515-a0a6-7bcf2e03-0_2-27-433_20230711165027926.parquet}, > id='HoodieFileGroupId{partitionPath='part=2023071116', > fileId='0002-7160-4515-a0a6-7bcf2e03-0'}', > metrics={TOTAL_LOG_FILES=1.0, TOTAL_IO_READ_MB=0.0, > TOTAL_LOG_FILES_SIZE=3591.0, TOTAL_IO_WRITE_MB=0.0, TOTAL_IO_MB=0.0}, > bootstrapFilePath=Optional.empty}, > CompactionOperation{baseInstantTime='20230711165027926', > dataFileCommitTime=Option{val=20230711165027926}, > deltaFileNames=[.0003-5a31-411f-8430-ccf4bec128e8-0_20230711165027926.log.1_3-53-672], > > dataFileName=Option{val=0003-5a31-411f-8430-ccf4bec128e8-0_3-27-434_20230711165027926.parquet}, > id='HoodieFileGroupId{partitionPath='part=2023071116', > fileId='0003-5a31-411f-8430-ccf4bec128e8-0'}', > metrics={TOTAL_LOG_FILES=1.0, TOTAL_IO_READ_MB=0.0, > TOTAL_LOG_FILES_SIZE=3207.0, TOTAL_IO_WRITE_MB=0.0, TOTAL_IO_MB=0.0}, > bootstrapFilePath=Optional.empty}] files > 23/07/11 16:50:57 ERROR MicroBatchExecution: Query RateStreamSource [id = > 44581078-04ee-48ae-bc74-143b3c836a23, runId = > 916ce5a4-bd8a-4010-8c0e-869c58db41ab] terminated with error > org.apache.hudi.exception.HoodieIOException: Failed to parse > HoodieCommitMetadata for [==>20230711165055791__compaction__REQUESTED] > at > org.apache.hudi.common.util.CommitUtils.lambda$getValidCheckpointForCurrentWriter$1(CommitUtils.java:173) > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > at > java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:356) > at > java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:500) > at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at > java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) > at > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:464) > at > org.apache.hudi.common.util.CommitUtils.getValidCheckpointForCurrentWriter(CommitUtils.java:175) > at > org.apache.hudi.HoodieStreamingSink.canSkipBatch(HoodieStreamingSink.scala:313) > at > org.apache.hudi.HoodieStreamingSink.addBatch(HoodieStreamingSink.scala:104) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$17(MicroBatchExecution.scala:665) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionI
[GitHub] [hudi] eric9204 opened a new pull request, #9176: [HUDI-6523] fix get valid checkpoint for current writer
eric9204 opened a new pull request, #9176: URL: https://github.com/apache/hudi/pull/9176 ### Change Logs filter commit action and deltacommit action when get valid checkpoint for current writer ### Impact NONE ### Risk level (write none, low medium or high below) NONE ### Documentation Update NONE ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stream2000 commented on a diff in pull request #9175: [MINOR] StreamWriteOperatorCoordinator stop current instant heartbeat…
stream2000 commented on code in PR #9175: URL: https://github.com/apache/hudi/pull/9175#discussion_r1260529173 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java: ## @@ -362,6 +362,10 @@ private void initClientIds(Configuration conf) { private void reset() { this.eventBuffer = new WriteMetadataEvent[this.parallelism]; +if (!this.instant.equals(WriteMetadataEvent.BOOTSTRAP_INSTANT)) { + writeClient.getHeartbeatClient().stop(instant); + this.ckpMetadata.abortInstant(instant); Review Comment: wondering if we need abort the ckpMetadata... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stream2000 opened a new pull request, #9175: [MINOR] StreamWriteOperatorCoordinator stop current instant heartbeat…
stream2000 opened a new pull request, #9175: URL: https://github.com/apache/hudi/pull/9175 … before start new instant ### Change Logs When some sub-tasks failed and global failure triggered, write tasks will be reset and re-deployed. Due to `StreamWriteCoordinator` do not extends `RecreateOnResetOperatorCoordinator`, the coordinator won't be close so the heartbeat thread of last instant will remain in the JVM. The whole process is like: sub tasks failed -> global failure trigged -> coodinator reset write tasks -> write tasks recreate and boostrap -> coordinator handle bootstrap event -> start new instant if don't need to recommit ( thus the last instant heartbeat will remain) ### Impact stop current instant heartbeat before start new instant ### Risk level (write none, low medium or high below) low ### Documentation Update NONE ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6524) Support re-attempting failed compaction
sivabalan narayanan created HUDI-6524: - Summary: Support re-attempting failed compaction Key: HUDI-6524 URL: https://issues.apache.org/jira/browse/HUDI-6524 Project: Apache Hudi Issue Type: Improvement Components: deltastreamer Reporter: sivabalan narayanan When compaction fails and if user restarts ingestion, we need support to re-attempt compaction w/o doing delta commits. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6523) Fix get valid checkpoint for current writer
eric created HUDI-6523: -- Summary: Fix get valid checkpoint for current writer Key: HUDI-6523 URL: https://issues.apache.org/jira/browse/HUDI-6523 Project: Apache Hudi Issue Type: Bug Components: spark Reporter: eric Fix For: 0.14.0 23/07/11 16:50:57 INFO HoodieCompactor: Compactor compacting [CompactionOperation{baseInstantTime='20230711165027926', dataFileCommitTime=Option{val=20230711165027926}, deltaFileNames=[.0001-2bae-40d8-8038-401eefb9e7e3-0_20230711165027926.log.1_1-53-670], dataFileName=Option{val=0001-2bae-40d8-8038-401eefb9e7e3-0_1-27-432_20230711165027926.parquet}, id='HoodieFileGroupId{partitionPath='part=2023071116', fileId='0001-2bae-40d8-8038-401eefb9e7e3-0'}', metrics={TOTAL_LOG_FILES=1.0, TOTAL_IO_READ_MB=0.0, TOTAL_LOG_FILES_SIZE=3976.0, TOTAL_IO_WRITE_MB=0.0, TOTAL_IO_MB=0.0}, bootstrapFilePath=Optional.empty}, CompactionOperation{baseInstantTime='20230711165027926', dataFileCommitTime=Option{val=20230711165027926}, deltaFileNames=[.-eb61-4788-a9cb-aaa67e2e47c4-0_20230711165027926.log.1_0-53-671], dataFileName=Option{val=-eb61-4788-a9cb-aaa67e2e47c4-0_0-27-431_20230711165027926.parquet}, id='HoodieFileGroupId{partitionPath='part=2023071116', fileId='-eb61-4788-a9cb-aaa67e2e47c4-0'}', metrics={TOTAL_LOG_FILES=1.0, TOTAL_IO_READ_MB=0.0, TOTAL_LOG_FILES_SIZE=3592.0, TOTAL_IO_WRITE_MB=0.0, TOTAL_IO_MB=0.0}, bootstrapFilePath=Optional.empty}, CompactionOperation{baseInstantTime='20230711165027926', dataFileCommitTime=Option{val=20230711165027926}, deltaFileNames=[.0002-7160-4515-a0a6-7bcf2e03-0_20230711165027926.log.1_2-53-673], dataFileName=Option{val=0002-7160-4515-a0a6-7bcf2e03-0_2-27-433_20230711165027926.parquet}, id='HoodieFileGroupId{partitionPath='part=2023071116', fileId='0002-7160-4515-a0a6-7bcf2e03-0'}', metrics={TOTAL_LOG_FILES=1.0, TOTAL_IO_READ_MB=0.0, TOTAL_LOG_FILES_SIZE=3591.0, TOTAL_IO_WRITE_MB=0.0, TOTAL_IO_MB=0.0}, bootstrapFilePath=Optional.empty}, CompactionOperation{baseInstantTime='20230711165027926', dataFileCommitTime=Option{val=20230711165027926}, deltaFileNames=[.0003-5a31-411f-8430-ccf4bec128e8-0_20230711165027926.log.1_3-53-672], dataFileName=Option{val=0003-5a31-411f-8430-ccf4bec128e8-0_3-27-434_20230711165027926.parquet}, id='HoodieFileGroupId{partitionPath='part=2023071116', fileId='0003-5a31-411f-8430-ccf4bec128e8-0'}', metrics={TOTAL_LOG_FILES=1.0, TOTAL_IO_READ_MB=0.0, TOTAL_LOG_FILES_SIZE=3207.0, TOTAL_IO_WRITE_MB=0.0, TOTAL_IO_MB=0.0}, bootstrapFilePath=Optional.empty}] files 23/07/11 16:50:57 ERROR MicroBatchExecution: Query RateStreamSource [id = 44581078-04ee-48ae-bc74-143b3c836a23, runId = 916ce5a4-bd8a-4010-8c0e-869c58db41ab] terminated with error org.apache.hudi.exception.HoodieIOException: Failed to parse HoodieCommitMetadata for [==>20230711165055791__compaction__REQUESTED] at org.apache.hudi.common.util.CommitUtils.lambda$getValidCheckpointForCurrentWriter$1(CommitUtils.java:173) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:356) at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:500) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:464) at org.apache.hudi.common.util.CommitUtils.getValidCheckpointForCurrentWriter(CommitUtils.java:175) at org.apache.hudi.HoodieStreamingSink.canSkipBatch(HoodieStreamingSink.scala:313) at org.apache.hudi.HoodieStreamingSink.addBatch(HoodieStreamingSink.scala:104) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$17(MicroBatchExecution.scala:665) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:663) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:375) at org
[GitHub] [hudi] nylqd commented on issue #9162: [SUPPORT] Permission denied access ckp_meta while start flink job
nylqd commented on issue #9162: URL: https://github.com/apache/hudi/issues/9162#issuecomment-1631793719 > @nylqd Can you please share the permissions for all directories under: > > > > ``` > > /path_to_hudi_table/.hoodie/.aux/ckp_meta/. > > ``` > > > > When restarting a job, under normal conditions, the `ckp_meta` is deleted and recreated again. > > > > The code (from master branch, iiuc, the code for this section hasn't changed that much since 0.11) for it is shown below: > > > > https://github.com/apache/hudi/assets/6312314/bbf1884c-7daa-45e7-9624-e292c529c9e9";> > > > > This is how mkdirs is implemented > > > > https://github.com/apache/hudi/assets/6312314/20772086-7a6c-40b5-a812-bd1d7d4fa1be";> > > > > Of which the default permissions of 777 is used > > https://github.com/apache/hudi/assets/6312314/c4204d06-1ca6-4c86-b9b8-7419f87d8968";> this aux dir while job is running ![image](https://github.com/apache/hudi/assets/3632490/63108195-e046-4a69-828a-6588aa862095) after stop the job ![image](https://github.com/apache/hudi/assets/3632490/2ec6321d-5cf3-4141-8c48-ddf34995e79a) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8832: [HUDI-6288] Create IngestionPrimaryWriterBasedConflictResolutionStrategy to prioritize ingestion writers over other writers
Zouxxyy commented on code in PR #8832: URL: https://github.com/apache/hudi/pull/8832#discussion_r1260517417 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/IngestionPrimaryWriterBasedConflictResolutionStrategy.java: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.client.transaction; + +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.util.ClusteringUtils; +import org.apache.hudi.common.util.CollectionUtils; +import org.apache.hudi.common.util.Option; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Comparator; +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMPACTION_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION; + +/** + * This class extends the base implementation of conflict resolution strategy. + * It gives preference to ingestion writers compared to table services. + */ +public class IngestionPrimaryWriterBasedConflictResolutionStrategy +extends SimpleConcurrentFileWritesConflictResolutionStrategy { + + private static final Logger LOG = LoggerFactory.getLogger(IngestionPrimaryWriterBasedConflictResolutionStrategy.class); + + /** + * For tableservices like replacecommit and compaction commits this method also returns ingestion inflight commits. + */ + @Override + public Stream getCandidateInstants(HoodieTableMetaClient metaClient, HoodieInstant currentInstant, +Option lastSuccessfulInstant) { +HoodieActiveTimeline activeTimeline = metaClient.reloadActiveTimeline(); +if ((REPLACE_COMMIT_ACTION.equals(currentInstant.getAction()) + && ClusteringUtils.isClusteringCommit(metaClient, currentInstant)) +|| COMPACTION_ACTION.equals(currentInstant.getAction())) { + return getCandidateInstantsForTableServicesCommits(activeTimeline, currentInstant); +} else { + return getCandidateInstantsForNonTableServicesCommits(activeTimeline, currentInstant); +} + } + + private Stream getCandidateInstantsForNonTableServicesCommits(HoodieActiveTimeline activeTimeline, HoodieInstant currentInstant) { + +// To findout which instants are conflicting, we apply the following logic +// Get all the completed instants timeline only for commits that have happened +// since the last successful write based on the transition times. +// We need to check for write conflicts since they may have mutated the same files +// that are being newly created by the current write. +List completedCommitsInstants = activeTimeline +.getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, REPLACE_COMMIT_ACTION, COMPACTION_ACTION, DELTA_COMMIT_ACTION)) +.filterCompletedInstants() + .findInstantsModifiedAfterByStateTransitionTime(currentInstant.getTimestamp()) +.getInstantsOrderedByStateTransitionTime() +.collect(Collectors.toList()); +LOG.info(String.format("Instants that may have conflict with %s are %s", currentInstant, completedCommitsInstants)); +return completedCommitsInstants.stream(); + } + + /** + * To find which instants are conflicting, we apply the following logic + * Get both completed instants and ingestion inflight commits that have happened since the last successful write. + * We need to check for write conflicts since they may have mutated the same files + * that are being newly created by the current write. + */ + private Stream getCandidateInstantsForTableServicesCommits(HoodieActiveTimeline activeTimeline, HoodieInstant currentInstant) { +// Fetch list of completed commits. +Stream completedCommitsStream = +activeT
[GitHub] [hudi] hudi-bot commented on pull request #9158: [MINOR] Unpersist only relevent metadata table RDDs
hudi-bot commented on PR #9158: URL: https://github.com/apache/hudi/pull/9158#issuecomment-1631790941 ## CI report: * cfc0f3271163b1d5cc16b4766210b67c61ecacf6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18488) * 5a26a9ab4d8ccd2f9d637c3a79517d44dc75cbbc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18510) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17
hudi-bot commented on PR #8949: URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631790650 ## CI report: * 28e822b1a4b83751727dbca0d3d632ca9ca2ac3e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18509) * 66acd28baa87819b949d165775556169243ccc3f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8832: [HUDI-6288] Create IngestionPrimaryWriterBasedConflictResolutionStrategy to prioritize ingestion writers over other writers
Zouxxyy commented on code in PR #8832: URL: https://github.com/apache/hudi/pull/8832#discussion_r1260517417 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/IngestionPrimaryWriterBasedConflictResolutionStrategy.java: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.client.transaction; + +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.util.ClusteringUtils; +import org.apache.hudi.common.util.CollectionUtils; +import org.apache.hudi.common.util.Option; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Comparator; +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMPACTION_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION; + +/** + * This class extends the base implementation of conflict resolution strategy. + * It gives preference to ingestion writers compared to table services. + */ +public class IngestionPrimaryWriterBasedConflictResolutionStrategy +extends SimpleConcurrentFileWritesConflictResolutionStrategy { + + private static final Logger LOG = LoggerFactory.getLogger(IngestionPrimaryWriterBasedConflictResolutionStrategy.class); + + /** + * For tableservices like replacecommit and compaction commits this method also returns ingestion inflight commits. + */ + @Override + public Stream getCandidateInstants(HoodieTableMetaClient metaClient, HoodieInstant currentInstant, +Option lastSuccessfulInstant) { +HoodieActiveTimeline activeTimeline = metaClient.reloadActiveTimeline(); +if ((REPLACE_COMMIT_ACTION.equals(currentInstant.getAction()) + && ClusteringUtils.isClusteringCommit(metaClient, currentInstant)) +|| COMPACTION_ACTION.equals(currentInstant.getAction())) { + return getCandidateInstantsForTableServicesCommits(activeTimeline, currentInstant); +} else { + return getCandidateInstantsForNonTableServicesCommits(activeTimeline, currentInstant); +} + } + + private Stream getCandidateInstantsForNonTableServicesCommits(HoodieActiveTimeline activeTimeline, HoodieInstant currentInstant) { + +// To findout which instants are conflicting, we apply the following logic +// Get all the completed instants timeline only for commits that have happened +// since the last successful write based on the transition times. +// We need to check for write conflicts since they may have mutated the same files +// that are being newly created by the current write. +List completedCommitsInstants = activeTimeline +.getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, REPLACE_COMMIT_ACTION, COMPACTION_ACTION, DELTA_COMMIT_ACTION)) +.filterCompletedInstants() + .findInstantsModifiedAfterByStateTransitionTime(currentInstant.getTimestamp()) +.getInstantsOrderedByStateTransitionTime() +.collect(Collectors.toList()); +LOG.info(String.format("Instants that may have conflict with %s are %s", currentInstant, completedCommitsInstants)); +return completedCommitsInstants.stream(); + } + + /** + * To find which instants are conflicting, we apply the following logic + * Get both completed instants and ingestion inflight commits that have happened since the last successful write. + * We need to check for write conflicts since they may have mutated the same files + * that are being newly created by the current write. + */ + private Stream getCandidateInstantsForTableServicesCommits(HoodieActiveTimeline activeTimeline, HoodieInstant currentInstant) { +// Fetch list of completed commits. +Stream completedCommitsStream = +activeT
[GitHub] [hudi] ChiaWeiGithub opened a new issue, #9174: [SUPPORT]flink sql hudi will create duplicate record on same s3 path when the data size is bigger enough to create new filegroup and different
ChiaWeiGithub opened a new issue, #9174: URL: https://github.com/apache/hudi/issues/9174 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** flink sql hudi will create duplicate record on same s3 path when` the data size is bigger enough to create new filegroup and different flink session with different yarn application.` A clear and concise description of the problem. **To Reproduce** Steps to reproduce the behavior: 1. Downloading e-commerce data from https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store 2. Upload to a s3 bucket 3. Run the below query from Athena. Please modify the s3 path ` CREATE EXTERNAL TABLE `ecommerce`( `event_time` string, `event_type` string, `product_id` bigint, `category_id` bigint, `category_code` string, `brand` string, `price` double, `user_id` bigint, `user_session` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3:///opendata/ecommerce/' CREATE TABLE ecommercedistinct WITH ( format = 'Parquet', write_compression = 'SNAPPY' ) AS SELECT sum(price) as price,user_session, CAST(NULL AS INTEGER) AS empty_column FROM "your_database"."ecommerce" where user_session IS NOT NULL group by user_session; ` 4. Since the data still has null data in user_session, it will break the query later on. We need to find out the null user session. select "$path" , user_session from "ecommercedistinct" order by user_session ASC; 5. random pick up 5 file under the s3 path of table ecommercedistinct, but avoid the file with null user_session 6. launch emr cluster primary : emr-6.7.0 with flink m5.2xlarge, 5 core node with m5.2xlarge. 7. create flink session `export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath` sudo /usr/lib/flink/bin/start-cluster.sh flink-yarn-session -d -jm 12288 -tm 12288 /usr/lib/flink/bin/sql-client.sh embedded -e /home/hadoop/sql-env.yaml flink sql> add jar '/usr/lib/hudi/hudi-flink1.14-bundle_2.12-0.11.0-amzn-0.jar'; # create source table. Please modified s3 path to your s3 path with data in partaccesslog.tar.gz CREATE TABLE tablesource1 ( `price` double, `user_session` varchar(255) PRIMARY KEY, `empty_column` int ) with ('connector'='filesystem', 'path'= 's3:///tables/5file/','format'='parquet'); # create target table, set 'hoodie.copyonwrite.insert.split.size'='500' to enforce to create a bigger filegroup file, please modified s3 path accordingly. CREATE TABLE tabletarget1 ( `price` double, `user_session` varchar(255) PRIMARY KEY, `empty_column` int ) with ('connector'='hudi', 'path'= 's3:///ecommercetarget1/', 'table.type' = 'COPY_ON_WRITE', 'hoodie.compaction.payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload', 'write.payload.class'= 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload', 'write.batch-size'='32', 'hoodie.copyonwrite.insert.split.size'='500', 'table.exec.sink.not-null-enforcer'='drop'); INSERT INTO tabletarget1 select * from tablesource1; <== take about 6+ minutes to finished to create a file ` 7. Make duplicate record stop the sql session kill yarn application, create a new one start new flink sql session, random choose auser_session id and insert ` INSERT INTO tabletarget1 (user_session,empty_column) VALUES ('0bc5bd30-4b48-4637-b5f0-9e327707d2bc',1); ` 8. Go to Amazon Athena to check , we can see 2 user_session='94CFQJ4T7409AW2P' ` CREATE EXTERNAL TABLE `tabletarget`( `price` double, `user_session` string, `empty_column` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3:///ecommercetarget1/'; select "$path","user_session" from tabletarget where user_session='94CFQJ4T7409AW2P'; ` **Expected behavior** The record is not duplicate when the data size is small. For example, if I create an empty table and insert values, it will always be the deduplicate. No matter I user new fli
[GitHub] [hudi] hudi-bot commented on pull request #9158: [MINOR] Unpersist only relevent metadata table RDDs
hudi-bot commented on PR #9158: URL: https://github.com/apache/hudi/pull/9158#issuecomment-1631785976 ## CI report: * cfc0f3271163b1d5cc16b4766210b67c61ecacf6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18488) * 5a26a9ab4d8ccd2f9d637c3a79517d44dc75cbbc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.
hudi-bot commented on PR #9106: URL: https://github.com/apache/hudi/pull/9106#issuecomment-1631780663 ## CI report: * 2c07b3e13de51845aad4e280c5fb07688f103d4a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18500) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17
hudi-bot commented on PR #8949: URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631780464 ## CI report: * 5f1f42b93a005a10f308559091bd1f329a7318ff Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18508) * 28e822b1a4b83751727dbca0d3d632ca9ca2ac3e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18509) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9158: [MINOR] Unpersist only relevent metadata table RDDs
danny0405 commented on code in PR #9158: URL: https://github.com/apache/hudi/pull/9158#discussion_r1260505794 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java: ## @@ -207,15 +206,15 @@ public void putCachedDataIds(HoodieDataCacheKey cacheKey, int... ids) { @Override public List getCachedDataIds(HoodieDataCacheKey cacheKey) { synchronized (cachedRddIds) { - return cachedRddIds.getOrDefault(cacheKey, Collections.emptyList()); + return cachedRddIds.getOrDefault(cacheKey, new ArrayList<>()); Review Comment: > Another way, I can new a ArrayList to addAll items, just like: This is always better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #9160: [HUDI-6501] HoodieHeartbeatClient should stop all heartbeats and not delete heartbeat files for close
danny0405 commented on PR #9160: URL: https://github.com/apache/hudi/pull/9160#issuecomment-1631772609 cc @bvaradar , can you take a look at this change? I'm not very clear with the impact on Spark engine/workflow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] KnightChess commented on a diff in pull request #9158: [MINOR] Unpersist only relevent metadata table RDDs
KnightChess commented on code in PR #9158: URL: https://github.com/apache/hudi/pull/9158#discussion_r1260502975 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java: ## @@ -207,15 +206,15 @@ public void putCachedDataIds(HoodieDataCacheKey cacheKey, int... ids) { @Override public List getCachedDataIds(HoodieDataCacheKey cacheKey) { synchronized (cachedRddIds) { - return cachedRddIds.getOrDefault(cacheKey, Collections.emptyList()); + return cachedRddIds.getOrDefault(cacheKey, new ArrayList<>()); Review Comment: fix this error, empty list class can not add item, I will merge two lists ```shell java.lang.UnsupportedOperationException at java.util.AbstractList.add(AbstractList.java:148) at java.util.AbstractList.add(AbstractList.java:108) ``` Another way, I can new a ArrayList to addAll items, just like: ```java if (config.areReleaseResourceEnabled()) { HoodieSparkEngineContext sparkEngineContext = (HoodieSparkEngineContext) context; Map> allCachedRdds = sparkEngineContext.getJavaSparkContext().getPersistentRDDs(); List allDataIds = new ArrayList<>(sparkEngineContext.removeCachedDataIds(HoodieDataCacheKey.of(basePath, instantTime))); if (config.isMetadataTableEnabled()) { String metadataTableBasePath = HoodieTableMetadata.getMetadataTableBasePath(basePath); allDataIds.addAll(sparkEngineContext.removeCachedDataIds(HoodieDataCacheKey.of(metadataTableBasePath, instantTime))); } for (int id : allDataIds) { if (allCachedRdds.containsKey(id)) { allCachedRdds.get(id).unpersist(); } } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9165: [HUDI-6517] Throw error if deletion of invalid data file fails
danny0405 commented on code in PR #9165: URL: https://github.com/apache/hudi/pull/9165#discussion_r1260496231 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java: ## @@ -688,7 +688,10 @@ private void deleteInvalidFilesByPartitions(HoodieEngineContext context, Map
[GitHub] [hudi] danny0405 commented on issue #9132: [SUPPORT] hudi deltastreamer jsonkafka source schema registry fail
danny0405 commented on issue #9132: URL: https://github.com/apache/hudi/issues/9132#issuecomment-1631762925 > Exception in thread "main" org.apache.avro.SchemaParseException: Type not supported: object at org.apache.avro.Schema.parse(Schema.java:1734) at org.apache.avro.Schema$Parser.parse(Schema.java:1430) It is kind of apparent that the schema fetched from the provider is wrong, the avro can not parse correctly, not sure whether it is related with the avro version, what version of avro did you use when generating the schema string ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #9172: [SUPPORT] java.lang.ClassCastException with incremental query
danny0405 commented on issue #9172: URL: https://github.com/apache/hudi/issues/9172#issuecomment-1631754480 It seems the vectorization execution is enabled while Hudi can not handle it, it is kind of related with the Spark version, can you use spark 3.2.x and try again ~, in 0.14.0, it would be fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9158: [MINOR] Unpersist only relevent metadata table RDDs
danny0405 commented on code in PR #9158: URL: https://github.com/apache/hudi/pull/9158#discussion_r1260483097 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java: ## @@ -207,15 +206,15 @@ public void putCachedDataIds(HoodieDataCacheKey cacheKey, int... ids) { @Override public List getCachedDataIds(HoodieDataCacheKey cacheKey) { synchronized (cachedRddIds) { - return cachedRddIds.getOrDefault(cacheKey, Collections.emptyList()); + return cachedRddIds.getOrDefault(cacheKey, new ArrayList<>()); Review Comment: why this change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17
hudi-bot commented on PR #8949: URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631750583 ## CI report: * 4360020cdb803dff6dc4e21bab9745ec287956f2 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18507) * 5f1f42b93a005a10f308559091bd1f329a7318ff Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18508) * 28e822b1a4b83751727dbca0d3d632ca9ca2ac3e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.
danny0405 commented on PR #9106: URL: https://github.com/apache/hudi/pull/9106#issuecomment-1631750486 > then a wrong setting here would probably keep only a single HFile Can we add some validation logic in metadata table write config builder and guard the correctness? To keep at least 2 version for each file group will also double the storage for metadata table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17
hudi-bot commented on PR #8949: URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631745561 ## CI report: * b5e4f56f3b7c905cfc782c997d586f717f93889c Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18504) * 4360020cdb803dff6dc4e21bab9745ec287956f2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18507) * 5f1f42b93a005a10f308559091bd1f329a7318ff UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stream2000 commented on pull request #9169: [HUDI-6521] Disable failing test case.
stream2000 commented on PR #9169: URL: https://github.com/apache/hudi/pull/9169#issuecomment-1631745271 Hi @amrishlal and @nsivabalan, sorry for the test case failure introduced. I have raised #9163 and fixed ci. Should we just revert this pr and enable the test again? Thanks~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9168: [HUDI-6276] Rename HoodieDeltaStreamer to HoodieStreamer
hudi-bot commented on PR #9168: URL: https://github.com/apache/hudi/pull/9168#issuecomment-1631739180 ## CI report: * 55648de1a087b2081dde0ef11ef1eb6944e18e0c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18501) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17
hudi-bot commented on PR #8949: URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631712005 ## CI report: * b5e4f56f3b7c905cfc782c997d586f717f93889c Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18504) * 4360020cdb803dff6dc4e21bab9745ec287956f2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18507) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] soumilshah1995 commented on issue #9143: [SUPPORT] Failure to delete records with missing attributes from PostgresDebeziumSource
soumilshah1995 commented on issue #9143: URL: https://github.com/apache/hudi/issues/9143#issuecomment-1631710874 # Project : Using Apache Hudi Deltastreamer and AWS DMS Hands on Labs ![image](https://user-images.githubusercontent.com/39345855/228927370-f7264d4a-f026-4014-9df4-b063f000f377.png) -- Video Tutorials * Part 1: Project Overview : https://www.youtube.com/watch?v=D9L0NLSqC1s * Part 2: Aurora Setup : https://youtu.be/HR5A6iGb4LE * Part 3: https://youtu.be/rnyj5gkIPKA * Part 4: https://youtu.be/J1xvPIcDIaQ * part 5 https://youtu.be/On-6L6oJ158 * How to setup EMR cluster with VPC https://www.youtube.com/watch?v=-e1-Zsk17Ss&t=4s * PDF guide https://drive.google.com/file/d/1Hj_gyZ8o-wFf4tqTYZXOHNJz2f3ARepn/view -- # Steps ### Step 1: Create Aurora Source Database and update the seetings to enable CDC on Postgres * Read More : https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html * Create new Create parameter group as shows in video and make sure to edit these two setting as shown in video part 2 ``` rds.logical_replication 1 wal_sender_timeout 30 ``` once done please apply these to database and reboot your Database ### Step 2: Run Python file to create a table called sales in public schema in aurora and lets populate some data into table Run python run.py ``` try: import os import logging from functools import wraps from abc import ABC, abstractmethod from enum import Enum from logging import StreamHandler import uuid from datetime import datetime, timezone from random import randint import datetime import sqlalchemy as db from faker import Faker import random import psycopg2 import psycopg2.extras as extras from dotenv import load_dotenv load_dotenv(".env") except Exception as e: raise Exception("Error: {} ".format(e)) class Logging: """ This class is used for logging data to datadog an to the console. """ def __init__(self, service_name, ddsource, logger_name="demoapp"): self.service_name = service_name self.ddsource = ddsource self.logger_name = logger_name format = "[%(asctime)s] %(name)s %(levelname)s %(message)s" self.logger = logging.getLogger(self.logger_name) formatter = logging.Formatter(format, ) if logging.getLogger().hasHandlers(): logging.getLogger().setLevel(logging.INFO) else: logging.basicConfig(level=logging.INFO) global logger logger = Logging(service_name="database-common-module", ddsource="database-common-module", logger_name="database-common-module") def error_handling_with_logging(argument=None): def real_decorator(function): @wraps(function) def wrapper(self, *args, **kwargs): function_name = function.__name__ response = None try: if kwargs == {}: response = function(self) else: response = function(self, **kwargs) except Exception as e: response = { "status": -1, "error": {"message": str(e), "function_name": function.__name__}, } logger.logger.info(response) return response return wrapper return real_decorator class DatabaseInterface(ABC): @abstractmethod def get_data(self, query): """ For given query fetch the data :param query: Str :return: Dict """ def execute_non_query(self, query): """ Inserts data into SQL Server :param query: Str :return: Dict """ def insert_many(self, query, data): """ Insert Many items into database :param query: str :param data: tuple :return: Dict """ def get_data_batch(self, batch_size=10, query=""): """ Gets data into batches :param batch_size: INT :param query: STR :return: DICT """ def get_table(self, table_name=""): """ Gets the table from database :param table_name: STR :return: OBJECT """ class Settings(object): """settings class""" def __init__( self, port="",
[GitHub] [hudi] soumilshah1995 commented on issue #9170: [SUPPORT] Possibly unquoted identifier in Glue V4 Streaming Job
soumilshah1995 commented on issue #9170: URL: https://github.com/apache/hudi/issues/9170#issuecomment-1631709692 can you try these code ``` try: import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.sql.session import SparkSession from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from pyspark.sql import DataFrame, Row from pyspark.sql.functions import * from pyspark.sql.functions import col, to_timestamp, monotonically_increasing_id, to_date, when import datetime from awsglue import DynamicFrame import pyspark.sql.functions as F import boto3 except Exception as e: print("ERROR IMPORTS ", e) """ { "version": "0", "id": "6b437ef5-9686-f7c7-b7dd-a2438bf5c92c", "detail-type": "order", "source": "order", "account": "043916019468", "time": "2023-02-12T10:42:48Z", "region": "us-west-1", "resources": [], "detail": { "orderid": "a59a6e61-ebb3-4442--53d9654ab3eb", "customer_id": "71d03aa7-5179-4c62-ab7d-e9cbf981d889", "ts": "2023-02-12T10:42:48.264802", "order_value": "691", "priority": "MEDIUM" } } --- INPUT --- root |-- account: string (nullable = true) |-- detail: struct (nullable = true) ||-- customer_id: string (nullable = true) ||-- order_value: string (nullable = true) ||-- orderid: string (nullable = true) ||-- priority: string (nullable = true) ||-- ts: string (nullable = true) |-- detail-type: string (nullable = true) |-- id: string (nullable = true) |-- region: string (nullable = true) |-- resources: array (nullable = true) ||-- element: string (containsNull = true) |-- source: string (nullable = true) |-- time: string (nullable = true) |-- version: string (nullable = true) --- OUTPUT --- root |-- account: string (nullable = true) |-- detail_type: string (nullable = true) |-- event_id: string (nullable = true) |-- region: string (nullable = true) |-- source: string (nullable = true) |-- time: string (nullable = true) |-- version: string (nullable = true) |-- detail_customer_id: string (nullable = true) |-- detail_order_value: string (nullable = true) |-- detail_orderid: string (nullable = true) |-- detail_priority: string (nullable = true) |-- detail_ts: string (nullable = true) """ def create_spark_session(): spark = SparkSession \ .builder \ .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \ .getOrCreate() return spark spark = create_spark_session() sc = spark.sparkContext glueContext = GlueContext(sc) # == Settings db_name = "hudidb" kinesis_table_name = 'kinesis_order_events' table_name = "order" record_key = 'detail_orderid' precomb = 'detail_ts' s3_bucket = 'sX' s3_path_hudi = f's3a://{s3_bucket}/{table_name}/' s3_path_spark = f's3://{s3_bucket}/spark_checkpoints/' method = 'upsert' table_type = "MERGE_ON_READ" window_size = '10 seconds' starting_position_of_kinesis_iterator = 'trim_horizon' connection_options = { 'hoodie.table.name': table_name, "hoodie.datasource.write.storage.type": table_type, 'hoodie.datasource.write.recordkey.field': record_key, 'hoodie.datasource.write.table.name': table_name, 'hoodie.datasource.write.operation': method, 'hoodie.datasource.write.precombine.field': precomb, 'hoodie.datasource.hive_sync.enable': 'true', "hoodie.datasource.hive_sync.mode": "hms", 'hoodie.datasource.hive_sync.sync_as_datasource': 'false', 'hoodie.datasource.hive_sync.database': db_name, 'hoodie.datasource.hive_sync.table': table_name, 'hoodie.datasource.hive_sync.use_jdbc': 'false', 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hoodie.datasource.write.hive_style_partitioning': 'true', } # == starting_position_of_kinesis_iterator = starting_position_of_kinesis_iterator data_frame_DataSource0 = glueContext.create_data_frame.from_catalog( database=db_name, table_name=kinesis_table_name, transformation_ctx="DataSource0", additional_options={"inferSchema": "true", "startingPosition": starting_position_of_kinesis_iterator} ) def flatten_df(nested_df): flat_cols = [c[0] for
[GitHub] [hudi] soumilshah1995 commented on issue #9172: [SUPPORT] java.lang.ClassCastException with incremental query
soumilshah1995 commented on issue #9172: URL: https://github.com/apache/hudi/issues/9172#issuecomment-1631708363 can you share some code or other details ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] leesf commented on a diff in pull request #9160: [HUDI-6501] HoodieHeartbeatClient should stop all heartbeats and not delete heartbeat files for close
leesf commented on code in PR #9160: URL: https://github.com/apache/hudi/pull/9160#discussion_r1260443951 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/heartbeat/HoodieHeartbeatClient.java: ## @@ -185,36 +179,55 @@ public void start(String instantTime) { } /** - * Stops the heartbeat for the specified instant. - * @param instantTime + * Stops the heartbeat and deletes the heartbeat file for the specified instant. + * + * @param instantTime The instant time for the heartbeat. * @throws HoodieException */ public void stop(String instantTime) throws HoodieException { Heartbeat heartbeat = instantToHeartbeatMap.get(instantTime); -if (heartbeat != null && heartbeat.isHeartbeatStarted() && !heartbeat.isHeartbeatStopped()) { - LOG.info("Stopping heartbeat for instant " + instantTime); - heartbeat.getTimer().cancel(); - heartbeat.setHeartbeatStopped(true); - LOG.info("Stopped heartbeat for instant " + instantTime); +if (isHeartbeatStarted(heartbeat)) { + stopHeartbeatTimer(heartbeat); HeartbeatUtils.deleteHeartbeatFile(fs, basePath, instantTime); LOG.info("Deleted heartbeat file for instant " + instantTime); } } /** - * Stops all heartbeats started via this instance of the client. + * Stops all timers of heartbeats started via this instance of the client. + * * @throws HoodieException */ - public void stop() throws HoodieException { -instantToHeartbeatMap.values().forEach(heartbeat -> stop(heartbeat.getInstantTime())); + public void stopHeartbeatTimers() throws HoodieException { + instantToHeartbeatMap.values().stream().filter(this::isHeartbeatStarted).forEach(this::stopHeartbeatTimer); + } + + /** + * Whether the given heartbeat is started. + * + * @param heartbeat The heartbeat to check whether is started. + * @return Whether the heartbeat is started. + * @throws IOException + */ + private boolean isHeartbeatStarted(Heartbeat heartbeat) { +return heartbeat != null && heartbeat.isHeartbeatStarted() && !heartbeat.isHeartbeatStopped(); Review Comment: use `heartbeat != null && heartbeat.isHeartbeatStarted()` or `heartbeat != null && heartbeat.isHeartbeatStopped()` is enough? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17
hudi-bot commented on PR #8949: URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631706730 ## CI report: * 021d75ea25b016c6482fec0838e14ce72e4c05cd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18397) * b5e4f56f3b7c905cfc782c997d586f717f93889c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18504) * 4360020cdb803dff6dc4e21bab9745ec287956f2 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9145: [HUDI-6464] Codreview changes for Spark SQL Merge Into for pkless tables'
hudi-bot commented on PR #9145: URL: https://github.com/apache/hudi/pull/9145#issuecomment-1631673249 ## CI report: * 10f9adc8c62d6fae7219a4472cba010a1c1c0da0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18472) * 9e3c05bdf4ef8cb5bb800ccde85fde085b0d07af Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18506) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9131: [HUDI-6315] Feature flag for disabling optimized update/delete code path.
hudi-bot commented on PR #9131: URL: https://github.com/apache/hudi/pull/9131#issuecomment-1631673194 ## CI report: * e0622a6d0f2f7294ed079dd42cf5ff65a8718da3 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18478) * c6e27b201d5fe4ca6eff34f0efdd5a8da5a3a45d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18505) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17
hudi-bot commented on PR #8949: URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631672847 ## CI report: * 021d75ea25b016c6482fec0838e14ce72e4c05cd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18397) * b5e4f56f3b7c905cfc782c997d586f717f93889c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18504) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9145: [HUDI-6464] Codreview changes for Spark SQL Merge Into for pkless tables'
hudi-bot commented on PR #9145: URL: https://github.com/apache/hudi/pull/9145#issuecomment-1631667354 ## CI report: * 10f9adc8c62d6fae7219a4472cba010a1c1c0da0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18472) * 9e3c05bdf4ef8cb5bb800ccde85fde085b0d07af UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9131: [HUDI-6315] Feature flag for disabling optimized update/delete code path.
hudi-bot commented on PR #9131: URL: https://github.com/apache/hudi/pull/9131#issuecomment-1631667269 ## CI report: * e0622a6d0f2f7294ed079dd42cf5ff65a8718da3 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18478) * c6e27b201d5fe4ca6eff34f0efdd5a8da5a3a45d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17
hudi-bot commented on PR #8949: URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631666998 ## CI report: * 021d75ea25b016c6482fec0838e14ce72e4c05cd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18397) * b5e4f56f3b7c905cfc782c997d586f717f93889c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9173: Dynamic Partition Pruning Port
hudi-bot commented on PR #9173: URL: https://github.com/apache/hudi/pull/9173#issuecomment-1631661084 ## CI report: * 0d0d0cde8aa18c5e93ce256208ea3b27b210a815 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18502) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9165: [HUDI-6517] Throw error if deletion of invalid data file fails
nsivabalan commented on code in PR #9165: URL: https://github.com/apache/hudi/pull/9165#discussion_r1260403670 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java: ## @@ -688,7 +688,10 @@ private void deleteInvalidFilesByPartitions(HoodieEngineContext context, Map
[hudi] branch master updated: [HUDI-6521] Disable failing test case. (#9169)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 51ddf1affcd [HUDI-6521] Disable failing test case. (#9169) 51ddf1affcd is described below commit 51ddf1affcdead2e3b5e871ba4816c71e6f4b99a Author: Amrish Lal AuthorDate: Tue Jul 11 16:51:29 2023 -0700 [HUDI-6521] Disable failing test case. (#9169) --- .../src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala index 7cd90145507..9508da7eaa2 100644 --- a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala +++ b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala @@ -1191,7 +1191,8 @@ class TestInsertTable extends HoodieSparkSqlTestBase { } } - test("Test Bulk Insert Into Bucket Index Table") { + /** Ignore failing test case (see HUDI-6521 for more details) */ + ignore("Test Bulk Insert Into Bucket Index Table") { withSQLConf("hoodie.datasource.write.operation" -> "bulk_insert", "hoodie.bulkinsert.shuffle.parallelism" -> "1") { Seq("mor", "cow").foreach { tableType => Seq("true", "false").foreach { bulkInsertAsRow =>
[GitHub] [hudi] nsivabalan merged pull request #9169: [HUDI-6521] Disable failing test case.
nsivabalan merged PR #9169: URL: https://github.com/apache/hudi/pull/9169 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stream2000 commented on pull request #9156: [HUDI-6515] Fix bucket index row writer write record to wrong handle
stream2000 commented on PR #9156: URL: https://github.com/apache/hudi/pull/9156#issuecomment-1631644198 > The test case modified in this PR is failing. Sorry for the inconvenience, we have fixed the ci in #9163, could you please rebase the lastest master? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nbalajee commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …
nbalajee commented on code in PR #9035: URL: https://github.com/apache/hudi/pull/9035#discussion_r1260391602 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCreateHandle.java: ## @@ -217,6 +218,8 @@ public List close() { setupWriteStatus(); + // createCompleteMarkerFile throws hoodieException, if marker directory is not present. + createCompletedMarkerFile(partitionPath, this.instantTime); Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] amrishlal commented on pull request #9169: [HUDI-6521] Disable failing test case.
amrishlal commented on PR #9169: URL: https://github.com/apache/hudi/pull/9169#issuecomment-1631635510 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nbalajee commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …
nbalajee commented on code in PR #9035: URL: https://github.com/apache/hudi/pull/9035#discussion_r1260389155 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/DirectWriteMarkers.java: ## @@ -135,18 +155,42 @@ private String translateMarkerToDataPath(String markerPath) { return stripMarkerSuffix(rPath); } + public static String stripMarkerSuffix(String path) { +return path.substring(0, path.indexOf(HoodieTableMetaClient.INPROGRESS_MARKER_EXTN)); + } + + public static String stripOldStyleMarkerSuffix(String path) { +// marker file was created by older version of Hudi, with INPROGRESS_MARKER_EXTN (f1_w1_c1.marker). +// Rename to data file by replacing .marker with .parquet. +return String.format("%s%s", path.substring(0, path.indexOf(HoodieTableMetaClient.INPROGRESS_MARKER_EXTN)), +HoodieFileFormat.PARQUET.getFileExtension()); + } + @Override public Set allMarkerFilePaths() throws IOException { Set markerFiles = new HashSet<>(); if (doesMarkerDirExist()) { FSUtils.processFiles(fs, markerDirPath.toString(), fileStatus -> { - markerFiles.add(MarkerUtils.stripMarkerFolderPrefix(fileStatus.getPath().toString(), basePath, instantTime)); +// Only the inprogres markerFiles are to be included here +if (fileStatus.getPath().toString().contains(HoodieTableMetaClient.INPROGRESS_MARKER_EXTN)) { + markerFiles.add(MarkerUtils.stripMarkerFolderPrefix(fileStatus.getPath().toString(), basePath, instantTime)); +} return true; }, false); } return markerFiles; } + public boolean markerExists(Path markerPath) { +boolean objExists; Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nbalajee commented on pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …
nbalajee commented on PR #9035: URL: https://github.com/apache/hudi/pull/9035#issuecomment-1631633479 > With respect to the partial files, when the feature flag is enabled, it will reduce the possibility of a partial data files getting created. Since second and subsequent attempts, in the presence of of completed marker, would return success (with previously completed writeStatus) we are minimizing multiple tasks attempting to create files. Deleting the .hoodie/.temp/ marker folder, after the finalizeWrite, ensures that any new task attempting to create a data file (that would show up as a partially written file, when the Job/JVM exits) will fail at the time of writeHandle creation, if the marker folder is absent (and not create a partial file). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9173: Dynamic Partition Pruning Port
hudi-bot commented on PR #9173: URL: https://github.com/apache/hudi/pull/9173#issuecomment-1631627162 ## CI report: * 0d0d0cde8aa18c5e93ce256208ea3b27b210a815 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9169: [HUDI-6521] Disable failing test case.
hudi-bot commented on PR #9169: URL: https://github.com/apache/hudi/pull/9169#issuecomment-1631627132 ## CI report: * 7ccae5025f90656628abc0b75d452d4b3ad1dab6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18496) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9168: [HUDI-6276] Rename HoodieDeltaStreamer to HoodieStreamer
hudi-bot commented on PR #9168: URL: https://github.com/apache/hudi/pull/9168#issuecomment-1631627090 ## CI report: * 5783dc533f3360e756c581544a53e6996add1653 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18494) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18495) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18499) * 55648de1a087b2081dde0ef11ef1eb6944e18e0c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18501) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.
hudi-bot commented on PR #9106: URL: https://github.com/apache/hudi/pull/9106#issuecomment-1631626905 ## CI report: * eb56e1be9ea831362a61adccec2ec2826c86d6a7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18240) * 2c07b3e13de51845aad4e280c5fb07688f103d4a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18500) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nbalajee commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …
nbalajee commented on code in PR #9035: URL: https://github.com/apache/hudi/pull/9035#discussion_r1260376601 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/TimelineServerBasedWriteMarkers.java: ## @@ -86,6 +89,17 @@ public TimelineServerBasedWriteMarkers(HoodieTable table, String instantTime) { this.timeoutSecs = timeoutSecs; } + @Override + protected Path getMarkerPath(String partitionPath, String dataFileName, IOType type) { +return new Path(partitionPath, getMarkerFileName(dataFileName, type)); + } Review Comment: Removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nbalajee commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …
nbalajee commented on code in PR #9035: URL: https://github.com/apache/hudi/pull/9035#discussion_r1260376138 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java: ## @@ -512,6 +512,7 @@ public List close() { status.getStat().setFileSizeInBytes(logFileSize); } + createCompletedMarkerFile(partitionPath, baseInstantTime); Review Comment: Done. Added the check and changed the function to createCompletedMarkerFileIfRequired() -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nbalajee commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …
nbalajee commented on code in PR #9035: URL: https://github.com/apache/hudi/pull/9035#discussion_r1260374587 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/DirectWriteMarkers.java: ## @@ -135,18 +155,42 @@ private String translateMarkerToDataPath(String markerPath) { return stripMarkerSuffix(rPath); } + public static String stripMarkerSuffix(String path) { +return path.substring(0, path.indexOf(HoodieTableMetaClient.INPROGRESS_MARKER_EXTN)); + } + + public static String stripOldStyleMarkerSuffix(String path) { +// marker file was created by older version of Hudi, with INPROGRESS_MARKER_EXTN (f1_w1_c1.marker). +// Rename to data file by replacing .marker with .parquet. +return String.format("%s%s", path.substring(0, path.indexOf(HoodieTableMetaClient.INPROGRESS_MARKER_EXTN)), +HoodieFileFormat.PARQUET.getFileExtension()); + } + @Override public Set allMarkerFilePaths() throws IOException { Set markerFiles = new HashSet<>(); if (doesMarkerDirExist()) { FSUtils.processFiles(fs, markerDirPath.toString(), fileStatus -> { - markerFiles.add(MarkerUtils.stripMarkerFolderPrefix(fileStatus.getPath().toString(), basePath, instantTime)); +// Only the inprogres markerFiles are to be included here +if (fileStatus.getPath().toString().contains(HoodieTableMetaClient.INPROGRESS_MARKER_EXTN)) { Review Comment: explained above. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/DirectWriteMarkers.java: ## @@ -119,7 +139,7 @@ public Set createdAndMergedDataPaths(HoodieEngineContext context, int pa while (itr.hasNext()) { FileStatus status = itr.next(); String pathStr = status.getPath().toString(); - if (pathStr.contains(HoodieTableMetaClient.MARKER_EXTN) && !pathStr.endsWith(IOType.APPEND.name())) { + if (pathStr.contains(HoodieTableMetaClient.INPROGRESS_MARKER_EXTN) && !pathStr.endsWith(IOType.APPEND.name())) { result.add(translateMarkerToDataPath(pathStr)); Review Comment: 1. Every attempt to create a datafile will have a corresponding in-progress marker file. 2. A successful attempt at creating a data file would create a completion marker (if the flag is enabled) 3. If second/subsequent attempts are made to recreate the file and a completion marker exists when trying to create the write handle, if the flag is enabled, old write status is returned and the attempt is considered successful. 4. While an on-going write is in-progress (file write not completed yet), if a second attempt is started, a new in-progress marker will be created for the second/subsequent attempt. At the time of closing the writeHandle/file, the first process would create the completion marker. Second/subsequent files trying to close the writeHandle would cleanup the data file, upon seeing the presence of a completion marker. 5. When reconciling data files, all in-progress markers are read and the list is pruned by removing entries that have a writeStatus. Data files that have been created but don't have a corresponding write status are candidates for deletion during finalizeWrite. During step 5, only in-progress markers are considered (completed markers are not considered, as writeStatus is used as source of truth). ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java: ## @@ -138,9 +139,35 @@ protected Path makeNewFilePath(String partitionPath, String fileName) { * * @param partitionPath Partition path */ - protected void createMarkerFile(String partitionPath, String dataFileName) { -WriteMarkersFactory.get(config.getMarkersType(), hoodieTable, instantTime) -.create(partitionPath, dataFileName, getIOType(), config, fileId, hoodieTable.getMetaClient().getActiveTimeline()); + protected void createInProgressMarkerFile(String partitionPath, String dataFileName, String markerInstantTime) { +WriteMarkers writeMarkers = WriteMarkersFactory.get(config.getMarkersType(), hoodieTable, instantTime); +if (!writeMarkers.doesMarkerDirExist()) { + throw new HoodieIOException(String.format("Marker root directory absent : %s/%s (%s)", + partitionPath, dataFileName, markerInstantTime)); +} +if (config.enforceFinalizeWriteCheck() +&& writeMarkers.markerExists(writeMarkers.getCompletionMarkerPath("", "FINALIZE_WRITE", markerInstantTime, IOType.CREATE))) { Review Comment: > In general, instead of exactly match of the state, I prefer the final consistency of the files diff we have on master. +1 on performing final consistency of files based on the write statuses. This is our preferred approach. However, this change is trying to address extreme corner cases, that could result in extra files left on the dataset's partition folder, that might appea
[GitHub] [hudi] hudi-bot commented on pull request #9168: [HUDI-6276] Rename HoodieDeltaStreamer to HoodieStreamer
hudi-bot commented on PR #9168: URL: https://github.com/apache/hudi/pull/9168#issuecomment-1631621513 ## CI report: * 5783dc533f3360e756c581544a53e6996add1653 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18494) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18495) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18499) * 55648de1a087b2081dde0ef11ef1eb6944e18e0c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.
hudi-bot commented on PR #9106: URL: https://github.com/apache/hudi/pull/9106#issuecomment-1631621309 ## CI report: * eb56e1be9ea831362a61adccec2ec2826c86d6a7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18240) * 2c07b3e13de51845aad4e280c5fb07688f103d4a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] jonvex opened a new pull request, #9173: Dynamic Partition Pruning Port
jonvex opened a new pull request, #9173: URL: https://github.com/apache/hudi/pull/9173 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9168: [HUDI-6276] Rename HoodieDeltaStreamer to HoodieStreamer
hudi-bot commented on PR #9168: URL: https://github.com/apache/hudi/pull/9168#issuecomment-1631615497 ## CI report: * 5783dc533f3360e756c581544a53e6996add1653 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18494) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18495) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18499) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …
hudi-bot commented on PR #9035: URL: https://github.com/apache/hudi/pull/9035#issuecomment-1631615225 ## CI report: * 93088bc330d7c96229f24e40910e067e4e4caf36 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18497) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.
prashantwason commented on PR #9106: URL: https://github.com/apache/hudi/pull/9106#issuecomment-1631610969 @danny0405 @nsivabalan I think the cleaning strategy change for MDT is a bugfix because of the following enhancements: 1. Initial commit on the MDT will create hfiles 2. Rollbacks not actually rollback the MDT instead of adding a -f1, -f2 deltacommit If we KEEP_LATEST_COMMITS then a wrong setting here would probably keep only a single HFile and that will limit the rollback. We cannot rollback the MDT beyond the last hfile as we will lose data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bkosuru opened a new issue, #9172: [SUPPORT] java.lang.ClassCastException with incremental query
bkosuru opened a new issue, #9172: URL: https://github.com/apache/hudi/issues/9172 Hudi version: hudi-spark3.3-bundle_2.12-0.13.1 Table type: COW Env: GCP Dataproc batches V 1.1 I am getting the following exception with incremental query: 23/07/11 22:27:50 ERROR TaskSetManager: Task 552 in stage 1.0 failed 4 times; aborting job Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 552 in stage 1.0 failed 4 times, most recent failure: Lost task 552.3 in stage 1.0 (TID 771) (10.12.0.121 executor 34): java.lang.ClassCastException: class org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to class org.apache.spark.sql.catalyst.InternalRow (org.apache.spark.sql.vectorized.ColumnarBatch and org.apache.spark.sql.catalyst.InternalRow are in unnamed module of loader 'app') at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:514) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.$anonfun$doExecute$1(HashAggregateExec.scala:98) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.$anonfun$doExecute$1$adapted(HashAggregateExec.scala:95) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2716) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2652) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2651) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2651) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1189) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1189) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1189) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2904) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2846) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2835) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) Caused by: java.lang.ClassCastException: class org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to class org.apache.spark.sql.catalyst.InternalRow (org.apache.spark.sql.vectorized.ColumnarBatch and org.apache.spark.sql.catalyst.InternalRow are in unnamed module of loader 'app') at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.
[GitHub] [hudi] sydneyhoran closed pull request #9171: Thescore changes
sydneyhoran closed pull request #9171: Thescore changes URL: https://github.com/apache/hudi/pull/9171 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] sydneyhoran opened a new pull request, #9171: Thescore changes
sydneyhoran opened a new pull request, #9171: URL: https://github.com/apache/hudi/pull/9171 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on a diff in pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.
prashantwason commented on code in PR #9106: URL: https://github.com/apache/hudi/pull/9106#discussion_r1260360731 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java: ## @@ -116,11 +134,10 @@ public static HoodieWriteConfig createMetadataWriteConfig( // Below config is only used if isLogCompactionEnabled is set. .withLogCompactionBlocksThreshold(writeConfig.getMetadataLogCompactBlocksThreshold()) .build()) -.withParallelism(parallelism, parallelism) -.withDeleteParallelism(parallelism) -.withRollbackParallelism(parallelism) -.withFinalizeWriteParallelism(parallelism) -.withAllowMultiWriteOnSameInstant(true) + .withStorageConfig(HoodieStorageConfig.newBuilder().hfileMaxFileSize(maxHFileSizeBytes) + .logFileMaxSize(maxLogFileSizeBytes).logFileDataBlockMaxSize(maxLogBlockSizeBytes).build()) +.withRollbackParallelism(defaultParallelism) +.withFinalizeWriteParallelism(defaultParallelism) Review Comment: Yes. This was required because the previous code for commits would overwrite the same instant if it already exists. With the already commited rollback PR, if we get a commit on MDT with same timestamp as previous applied deltacommit then we will first rollback the previously applied deltacommit on MDT and then commit the new change. Hence, multi-write-on-same-instant is never possible in MDT. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on a diff in pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.
prashantwason commented on code in PR #9106: URL: https://github.com/apache/hudi/pull/9106#discussion_r1260359877 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -176,7 +176,7 @@ private void initMetadataReader() { } try { - this.metadata = new HoodieBackedTableMetadata(engineContext, dataWriteConfig.getMetadataConfig(), dataWriteConfig.getBasePath()); + this.metadata = new HoodieBackedTableMetadata(engineContext, dataWriteConfig.getMetadataConfig(), dataWriteConfig.getBasePath(), true); Review Comment: When initializing additional indexes (after files partition is already initialized), the file listing is taken from the HoodieBackedTableMetadata itself. In this case, it is better to have the reuse enabled so we dont keep listing each time an additional index is initialized. This is an optimization for the case when FILES index exists and we are initializing one or more indexes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #9168: [HUDI-6276] Rename HoodieDeltaStreamer to HoodieStreamer
yihua commented on PR #9168: URL: https://github.com/apache/hudi/pull/9168#issuecomment-1631606184 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org