[GitHub] [hudi] imrewang commented on issue #9513: [SUPPORT]Index Bootstrap deleted snapshot data that has been batch-inserted into Hudi ?
imrewang commented on issue #9513: URL: https://github.com/apache/hudi/issues/9513#issuecomment-1691036778 Delete some of them, but not all of them Just to implement dynamic update of data For example, in the `incremental data`, there is a deletion of a row of `snapshot data`, and now it is required to delete the corresponding row in the `snapshot data` that has been stored in Hudi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on pull request #9501: [HUDI-6735] Adding support for snapshotLoadQuerySplitter for incremental sources.
codope commented on PR #9501: URL: https://github.com/apache/hudi/pull/9501#issuecomment-1691022209 @harsh1231 `ITTestHoodieSyncCommand#testValidateSync` failed in GH actions. Can you please check? I've retrigged the Spark 3.0 tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6743) Use current instant time as part of the log file name
Danny Chen created HUDI-6743: Summary: Use current instant time as part of the log file name Key: HUDI-6743 URL: https://issues.apache.org/jira/browse/HUDI-6743 Project: Apache Hudi Issue Type: Improvement Components: core Reporter: Danny Chen Assignee: Danny Chen -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6742) Remove the log file appending for multiple instants
[ https://issues.apache.org/jira/browse/HUDI-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen reassigned HUDI-6742: Assignee: Danny Chen > Remove the log file appending for multiple instants > --- > > Key: HUDI-6742 > URL: https://issues.apache.org/jira/browse/HUDI-6742 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6742) Remove the log file appending for multiple instants
Danny Chen created HUDI-6742: Summary: Remove the log file appending for multiple instants Key: HUDI-6742 URL: https://issues.apache.org/jira/browse/HUDI-6742 Project: Apache Hudi Issue Type: Improvement Components: core Reporter: Danny Chen Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6742) Remove the log file appending for multiple instants
[ https://issues.apache.org/jira/browse/HUDI-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-6742: - Epic Link: HUDI-6640 Reviewers: Vinoth Chandar > Remove the log file appending for multiple instants > --- > > Key: HUDI-6742 > URL: https://issues.apache.org/jira/browse/HUDI-6742 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled
hudi-bot commented on PR #9511: URL: https://github.com/apache/hudi/pull/9511#issuecomment-1690992353 ## CI report: * 350a0a363d2ebaa995f99edf2bebd88fe5056a27 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19429) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled
hudi-bot commented on PR #9511: URL: https://github.com/apache/hudi/pull/9511#issuecomment-1690986681 ## CI report: * 350a0a363d2ebaa995f99edf2bebd88fe5056a27 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #9513: [SUPPORT]Index Bootstrap deleted snapshot data that has been batch-inserted into Hudi ?
danny0405 commented on issue #9513: URL: https://github.com/apache/hudi/issues/9513#issuecomment-1690982914 Why you wanna delete the snapshot data, then why you import it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] TengHuo commented on a diff in pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.
TengHuo commented on code in PR #9472: URL: https://github.com/apache/hudi/pull/9472#discussion_r1303746084 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ReplaceCommitValidateUtil.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.table.action.cluster; + +import org.apache.hudi.common.model.HoodieReplaceCommitMetadata; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.util.JsonUtils; +import org.apache.hudi.exception.HoodieException; + +import java.util.Arrays; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Set; +import java.util.stream.Stream; + +import static org.apache.hudi.common.util.FileIOUtils.LOG; + +public class ReplaceCommitValidateUtil { + public static final String REPLACE_COMMIT_FILE_IDS = "replaceCommitFileIds"; + public static void validateReplaceCommit(HoodieTableMetaClient metaClient) { +metaClient.reloadActiveTimeline(); +Set replaceFileids = new HashSet<>(); + +// Verify pending and completed replace commit + Stream.concat(metaClient.getActiveTimeline().getCompletedReplaceTimeline().getInstantsAsStream(), + metaClient.getActiveTimeline().filterInflights().filterPendingReplaceTimeline().getInstantsAsStream()).map(instant -> { + try { +HoodieReplaceCommitMetadata replaceCommitMetadata = + HoodieReplaceCommitMetadata.fromBytes(metaClient.getActiveTimeline().getInstantDetails(instant).get(), +HoodieReplaceCommitMetadata.class); Review Comment: I think this issue is caused by different writers isolation from each other, and there is no lock in Hudi timeline, so when one writer is trying to `validateReplaceCommit`, it's impossible to guarantee that there is no other writer is modifying the same timeline. As @boneanxs mentioned, there could be other situation such as two `insert overwrite` running together, or `compaction` and `clustering` generating plan at the same time, the current timeline can't provide enough information (any partitions modified or dropped) to each writer to prevent these concurrent issue. In Hudi [RFC-56](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md), OCC was introduced for solving write conflicts in file group level, but no conflict check in partition level. So I think this partition level conflict issue should be solved in OCC framework, it's better because all similar situation can be solved as long as OCC enabled. cc @voonhous -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] majian1998 commented on a diff in pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.
majian1998 commented on code in PR #9472: URL: https://github.com/apache/hudi/pull/9472#discussion_r1303744014 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ReplaceCommitValidateUtil.java: ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.table.action.cluster; + +import org.apache.hudi.common.model.HoodieReplaceCommitMetadata; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.util.JsonUtils; +import org.apache.hudi.exception.HoodieException; + +import java.io.IOException; +import java.util.Arrays; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Set; +import java.util.stream.Stream; + +import static org.apache.hudi.common.util.FileIOUtils.LOG; + +public class ReplaceCommitValidateUtil { + public static final String REPLACE_COMMIT_FILE_IDS = "replaceCommitFileIds"; Review Comment: I don't think this situation will happen: Because my validation was done after the inflight commit was written to disk. Suppose two inflight replace commits happen at the same time: 1. Simultaneous disk write: Even if the disk write happens at the same time, when validation begins, it will reload first, so it will definitely see the file that the other party has written to disk. 2. Simultaneous validation: This is even if both have written to disk, both validation will find each other and report errors. 3. One party validates while the other writes to disk: This means that the verifying party has already completed the disk write, so the committing party will definitely be found. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] LinMingQiang commented on pull request #7469: [HUDI-5386] Cleaning conflicts when write concurrency mode is OCC
LinMingQiang commented on PR #7469: URL: https://github.com/apache/hudi/pull/7469#issuecomment-1690920181 I can get what you mean,but using the heartbeat mechanism might be too heavy. Maybe we can add a ticket and ask the writer who obtains the ticket to perform rollback ,It's like a lock but it doesn't affect other writers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] LinMingQiang opened a new pull request, #7469: [HUDI-5386] Cleaning conflicts when write concurrency mode is OCC
LinMingQiang opened a new pull request, #7469: URL: https://github.com/apache/hudi/pull/7469 ### Change Logs getInstantsToRollback should be lock too. ### Impact occ mode. ### Risk level (write none, low medium or high below) none ### Documentation Update none ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable - [x] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] LzMingYueShanPao commented on issue #143: Tracking ticket for folks to be added to slack group
LzMingYueShanPao commented on issue #143: URL: https://github.com/apache/hudi/issues/143#issuecomment-1690915753 Can you add me to the slack? dingding1035807...@dingtalk.com -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] LinMingQiang commented on pull request #7469: [HUDI-5386] Cleaning conflicts when write concurrency mode is OCC
LinMingQiang commented on PR #7469: URL: https://github.com/apache/hudi/pull/7469#issuecomment-1690915756 > if I am not wrong, this is the core problem we are trying to solve. if there are failed commits and if two concurrent writers try to rollback concurrently we don't have a lock as such. > > These complications arise just bcoz, hudi tries to do automatic clean up of failed writes. In other similar systems, you may have to trigger explicit commands to clean up partially failed commits. or coordinate when multiple writers are involved. > > Wanted to call it out. Anyways, coming back to the original issue. Its recommended to disable table services (like cleaner, archival) in all writers except 1. So, we won't end up in such conflicts. These are anyways not latency sensitive. And w/ this approach all other writes will be even more faster since they don't trigger any of these table service and only take care of ingestion. > > We do have a table level config to disable all table services https://hudi.apache.org/docs/configurations/#hoodietableservicesenabled > > Having said all this, here is what I feel we could fix this issue. > > We can leverage the heartbeats, such that rollback commits also start to emit heartbeats. So, a concurrent writer know if some other writer is concurrently executing the rollback, or whether its in failed state. That way, only one writer will go ahead and execute the rollback while others will step away. > > I remember @suryaprasanna wanted to fix this if I am not wrong. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] LinMingQiang closed pull request #7469: [HUDI-5386] Cleaning conflicts when write concurrency mode is OCC
LinMingQiang closed pull request #7469: [HUDI-5386] Cleaning conflicts when write concurrency mode is OCC URL: https://github.com/apache/hudi/pull/7469 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on a diff in pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.
boneanxs commented on code in PR #9472: URL: https://github.com/apache/hudi/pull/9472#discussion_r1303729529 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ReplaceCommitValidateUtil.java: ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.table.action.cluster; + +import org.apache.hudi.common.model.HoodieReplaceCommitMetadata; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.util.JsonUtils; +import org.apache.hudi.exception.HoodieException; + +import java.io.IOException; +import java.util.Arrays; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Set; +import java.util.stream.Stream; + +import static org.apache.hudi.common.util.FileIOUtils.LOG; + +public class ReplaceCommitValidateUtil { + public static final String REPLACE_COMMIT_FILE_IDS = "replaceCommitFileIds"; Review Comment: How you validate it if such issue happens? Like 2 inflight commits written nearly at the same time, and each other doesn't see it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly
nsivabalan commented on code in PR #9422: URL: https://github.com/apache/hudi/pull/9422#discussion_r1303719719 ## hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestMORColstats.java: ## @@ -0,0 +1,481 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.functional; + +import org.apache.hudi.DataSourceReadOptions; +import org.apache.hudi.DataSourceWriteOptions; +import org.apache.hudi.client.SparkRDDWriteClient; +import org.apache.hudi.common.config.HoodieMetadataConfig; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; +import org.apache.hudi.common.testutils.HoodieTestDataGenerator; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.config.HoodieCompactionConfig; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.testutils.HoodieSparkClientTestBase; + +import org.apache.spark.SparkException; +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.api.io.TempDir; + +import java.io.File; +import java.io.IOException; +import java.nio.file.Files; +import java.nio.file.Path; +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Properties; +import java.util.Set; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +import static org.apache.hudi.common.testutils.RawTripTestPayload.recordToString; +import static org.apache.hudi.config.HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS; +import static org.apache.spark.sql.SaveMode.Append; +import static org.apache.spark.sql.SaveMode.Overwrite; +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertThrows; + +/** + * Test mor with colstats enabled in scenarios to ensure that files + * are being appropriately read or not read. + * The strategy employed is to corrupt targeted base files. If we want + * to prove the file is read, we assert that an exception will be thrown. + * If we want to prove the file is not read, we expect the read to + * successfully execute. + */ +public class TestMORColstats extends HoodieSparkClientTestBase { + + private static String matchCond = "trip_type = 'UBERX'"; + private static String nonMatchCond = "trip_type = 'BLACK'"; + private static String[] dropColumns = {"_hoodie_commit_time", "_hoodie_commit_seqno", + "_hoodie_record_key", "_hoodie_partition_path", "_hoodie_file_name"}; + + private Boolean shouldOverwrite; + Map options; + @TempDir + public java.nio.file.Path basePath; + + @BeforeEach + public void setUp() throws Exception { +initSparkContexts(); +dataGen = new HoodieTestDataGenerator(); +shouldOverwrite = true; +options = getOptions(); +Properties props = new Properties(); +props.putAll(options); +try { + metaClient = HoodieTableMetaClient.initTableAndGetMetaClient(hadoopConf, basePath.toString(), props); +} catch (IOException e) { + throw new RuntimeException(e); +} + } + + @AfterEach + public void tearDown() throws IOException { +cleanupSparkContexts(); +cleanupTestDataGenerator(); +metaClient = null; + } + + /** + * Create two files, one should be excluded by colstats + */ + @Test + public void testBaseFileOnly() { +Dataset inserts = makeInsertDf("000", 100); +Dataset batch1 = inserts.where(matchCond); +Dataset batch2 = inserts.where(nonMatchCond); +doWrite(batch1); +doWrite(batch2); +List filesToCorrupt = getFilesToCorrupt(); +assertEquals(1, filesToCorrupt.size()); +filesToCorrupt.forEach(TestMORColstats::corruptFile); +assertEquals(0, readMatchingRecords().except(batch1).count()); +//Read without data skipping to show that it will fail +//Reading with data skipping succeeded so that means that data skipping is working and the corrupted +//file
[GitHub] [hudi] nsivabalan merged pull request #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario
nsivabalan merged PR #9468: URL: https://github.com/apache/hudi/pull/9468 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario (#9468)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new d7c16f56f4f [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario (#9468) d7c16f56f4f is described below commit d7c16f56f4f9dfa3a160dac459ae11944f922ec8 Author: Jon Vexler AuthorDate: Wed Aug 23 22:30:41 2023 -0400 [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario (#9468) - If two cleans start at nearly the same time, they will both attempt to execute the same clean instances. This does not cause any data corruption, but will cause a writer to fail when they attempt to create the commit in the timeline. This is because the commit will have already been written by the first writer. Now, we check the timeline before transitioning state. Co-authored-by: Jonathan Vexler <=> --- .../hudi/table/action/clean/CleanActionExecutor.java | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java index 05e1056324a..c931e7bce9d 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java @@ -261,8 +261,10 @@ public class CleanActionExecutor extends BaseActionExecutor extends BaseActionExecutor 0 ? cleanMetadataList.get(cleanMetadataList.size() - 1) : null; } + + private void checkIfOtherWriterCommitted(HoodieInstant hoodieInstant, HoodieIOException e) { +table.getMetaClient().reloadActiveTimeline(); +if (table.getCleanTimeline().filterCompletedInstants().containsInstant(hoodieInstant.getTimestamp())) { + LOG.warn("Clean operation was completed by another writer for instant: " + hoodieInstant); +} else { + LOG.error("Failed to perform previous clean operation, instant: " + hoodieInstant, e); + throw e; +} + } }
[GitHub] [hudi] nsivabalan commented on pull request #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario
nsivabalan commented on PR #9468: URL: https://github.com/apache/hudi/pull/9468#issuecomment-1690896492 https://github.com/apache/hudi/assets/513218/645da90d-5ae4-4d4f-8c0a-7b6cc49e0c0a;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled
hudi-bot commented on PR #9511: URL: https://github.com/apache/hudi/pull/9511#issuecomment-1690872044 ## CI report: * 350a0a363d2ebaa995f99edf2bebd88fe5056a27 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19429) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled
hudi-bot commented on PR #9511: URL: https://github.com/apache/hudi/pull/9511#issuecomment-1690817940 ## CI report: * 350a0a363d2ebaa995f99edf2bebd88fe5056a27 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19429) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly
hudi-bot commented on PR #9422: URL: https://github.com/apache/hudi/pull/9422#issuecomment-1690817806 ## CI report: * 42d026cd694d6368e45b058a4ff7a9bd36b0d3a2 UNKNOWN * 976734054ed50853f16196cd68f8a3c2a5976728 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19430) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Riddle4045 commented on issue #9505: [SUPPORT] Scheduling Compaction using Hudi CLI fails
Riddle4045 commented on issue #9505: URL: https://github.com/apache/hudi/issues/9505#issuecomment-1690797389 closing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Riddle4045 closed issue #9505: [SUPPORT] Scheduling Compaction using Hudi CLI fails
Riddle4045 closed issue #9505: [SUPPORT] Scheduling Compaction using Hudi CLI fails URL: https://github.com/apache/hudi/issues/9505 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Riddle4045 commented on issue #9505: [SUPPORT] Scheduling Compaction using Hudi CLI fails
Riddle4045 commented on issue #9505: URL: https://github.com/apache/hudi/issues/9505#issuecomment-1690797249 This worked with spark 3.4, 3.1 & 3.3 failed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled
hudi-bot commented on PR #9511: URL: https://github.com/apache/hudi/pull/9511#issuecomment-1690787290 ## CI report: * 350a0a363d2ebaa995f99edf2bebd88fe5056a27 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Jason-liujc opened a new issue, #9512: [SUPPORT] No table level lock when using DynamoDB lock provider
Jason-liujc opened a new issue, #9512: URL: https://github.com/apache/hudi/issues/9512 **Describe the problem you faced** We have a usecase where we need a table level lock for multiple EMR clusters writing to the same Hudi table. We've tried the two options provided by Hudi OCC guide page. For `single_writer` option, we don't see any lock being created in DynamoDB, therefore there were no write locks and the EMR Hudi upsert job fails due to multiple writer writing to the same table. For `optimistic_concurrency_control`, it's creating and deleting many locks in DynamoDB entry. And the lock is not on table level. So when two jobs write to the same file, one of the job will fail. This leads to us having to add a lot of retries in our cluster, which is not ideal. Is there any **To Reproduce** Steps to reproduce the behavior: 1. Run Hudi upsert jobs in multiple AWS EMR cluster writing to the same table at the same time. 2. Use DynamoDB as the lock provider, these jobs write to the same partitions. 3. Use `single_writer` + `hoodie.cleaner.policy.failed.writes=EAGER`or `optimistic_concurrency_control` + `hoodie.cleaner.policy.failed.writes=LAZY` as part of the Hudi write option. 4. See each time, the jobs will fail due to different errors. **Expected behavior** Hudi would provide a table level lock that allow us to block other writers from writing until one of the writer is done when writing to the same partition. **Environment Description** * Hudi version : 0.13.0 (EMR 6.11) * Spark version : 3.3.1 * Hive version : 3.1.3 * Hadoop version : 3.3.3 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : No **Additional context** We are considering building something our own to enforce sequential insertion during orchestration. But this can be avoided if Hudi provide a table level lock. **Stacktrace** Single writer: ``` 23/08/21 23:19:47 ERROR Client: Application diagnostics message: User class threw exception: org.apache.hudi.exception.HoodieRollbackException: Failed to rollback s3://bucket/xxxkey commits 20230821231625673 at org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:823) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:727) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:711) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:706) at org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(BaseHoodieWriteClient.java:843) ``` Multi writer with `optimistic_concurrency_control`: ``` 23/08/21 21:40:31 ERROR Client: Application diagnostics message: User class threw exception: org.apache.hudi.exception.HoodieWriteConflictException: java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes at org.apache.hudi.client.transaction.SimpleConcurrentFileWritesConflictResolutionStrategy.resolveConflict(SimpleConcurrentFileWritesConflictResolutionStrategy.java:108) at org.apache.hudi.client.utils.TransactionUtils.lambda$resolveWriteConflictIfAny$0(TransactionUtils.java:85) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled
hudi-bot commented on PR #9511: URL: https://github.com/apache/hudi/pull/9511#issuecomment-1690681949 ## CI report: * 350a0a363d2ebaa995f99edf2bebd88fe5056a27 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19429) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9510: [HUDI-6739] Skip checking Active Timeline to improve performance for multiple streaming writers
hudi-bot commented on PR #9510: URL: https://github.com/apache/hudi/pull/9510#issuecomment-1690681912 ## CI report: * caf735fd3d8cba7d7e05ee7c8a08fcc6d810fb3f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19428) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled
the-other-tim-brown commented on code in PR #9511: URL: https://github.com/apache/hudi/pull/9511#discussion_r1303570089 ## hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java: ## @@ -220,16 +220,15 @@ public static FileSystemViewManager createViewManager(final HoodieEngineContext final HoodieMetadataConfig metadataConfig, final FileSystemViewStorageConfig config, final HoodieCommonConfig commonConfig) { -return createViewManager(context, metadataConfig, config, commonConfig, (SerializableSupplier) null); +return createViewManager(context, metadataConfig, config, commonConfig, null); } - public static FileSystemViewManager createViewManager(final HoodieEngineContext context, -final HoodieMetadataConfig metadataConfig, -final FileSystemViewStorageConfig config, -final HoodieCommonConfig commonConfig, -final String basePath) { + public static FileSystemViewManager createViewManagerWithTableMetadata(final HoodieEngineContext context, + final HoodieMetadataConfig metadataConfig, + final FileSystemViewStorageConfig config, + final HoodieCommonConfig commonConfig) { return createViewManager(context, metadataConfig, config, commonConfig, Review Comment: I think we would need a way to get this config at runtime but I don't see a way to do that with the current structure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #7469: [HUDI-5386] Cleaning conflicts when write concurrency mode is OCC
nsivabalan commented on PR #7469: URL: https://github.com/apache/hudi/pull/7469#issuecomment-1690616808 if I am not wrong, this is the core problem we are trying to solve. if there are failed commits and if two concurrent writers try to rollback concurrently we don't have a lock as such. These complications arise just bcoz, hudi tries to do automatic clean up of failed writes. In other similar systems, you may have to trigger explicit commands to clean up partially failed commits. or coordinate when multiple writers are involved. Wanted to call it out. Anyways, coming back to the original issue. Its recommended to disable table services (like cleaner, archival) in all writers except 1. So, we won't end up in such conflicts. These are anyways not latency sensitive. And w/ this approach all other writes will be even more faster since they don't trigger any of these table service and only take care of ingestion. We do have a table level config to disable all table services https://hudi.apache.org/docs/configurations/#hoodietableservicesenabled Having said all this, here is what I feel we could fix this issue. We can leverage the heartbeats, such that rollback commits also start to emit heartbeats. So, a concurrent writer know if some other writer is concurrently executing the rollback, or whether its in failed state. That way, only one writer will go ahead and execute the rollback while others will step away. I remember @suryaprasanna wanted to fix this if I am not wrong. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Riddle4045 commented on issue #9505: [SUPPORT] Scheduling Compaction using Hudi CLI fails
Riddle4045 commented on issue #9505: URL: https://github.com/apache/hudi/issues/9505#issuecomment-1690607980 @danny0405 is there any doc on what version of spark exactly works with `hudi-cli/hudi-cli.sh` I have tried both 2.x and 3.x and both have different dependency conflicts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled
yihua commented on code in PR #9511: URL: https://github.com/apache/hudi/pull/9511#discussion_r1303505252 ## hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java: ## @@ -220,16 +220,15 @@ public static FileSystemViewManager createViewManager(final HoodieEngineContext final HoodieMetadataConfig metadataConfig, final FileSystemViewStorageConfig config, final HoodieCommonConfig commonConfig) { -return createViewManager(context, metadataConfig, config, commonConfig, (SerializableSupplier) null); +return createViewManager(context, metadataConfig, config, commonConfig, null); } - public static FileSystemViewManager createViewManager(final HoodieEngineContext context, -final HoodieMetadataConfig metadataConfig, -final FileSystemViewStorageConfig config, -final HoodieCommonConfig commonConfig, -final String basePath) { + public static FileSystemViewManager createViewManagerWithTableMetadata(final HoodieEngineContext context, + final HoodieMetadataConfig metadataConfig, + final FileSystemViewStorageConfig config, + final HoodieCommonConfig commonConfig) { return createViewManager(context, metadataConfig, config, commonConfig, Review Comment: Not sure if we need to differentiate the `metadataConfig` across tables as well. For now, it's fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly
hudi-bot commented on PR #9422: URL: https://github.com/apache/hudi/pull/9422#issuecomment-1690550106 ## CI report: * 42d026cd694d6368e45b058a4ff7a9bd36b0d3a2 UNKNOWN * baf417339e41d1eb4bad6b351162e0457a9965ad Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19425) * 976734054ed50853f16196cd68f8a3c2a5976728 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19430) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly
hudi-bot commented on PR #9422: URL: https://github.com/apache/hudi/pull/9422#issuecomment-1690540693 ## CI report: * 42d026cd694d6368e45b058a4ff7a9bd36b0d3a2 UNKNOWN * baf417339e41d1eb4bad6b351162e0457a9965ad Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19425) * 976734054ed50853f16196cd68f8a3c2a5976728 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9482: [HUDI-6728] Update BigQuery manifest sync to support schema evolution
hudi-bot commented on PR #9482: URL: https://github.com/apache/hudi/pull/9482#issuecomment-1690448816 ## CI report: * 39166302aadd51524e017f92a883e960e07a37a4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19426) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario
hudi-bot commented on PR #9468: URL: https://github.com/apache/hudi/pull/9468#issuecomment-1690448684 ## CI report: * 6451d5f7db3326eb691185ec1b4351dc6cbbdd1b UNKNOWN * 9b2af18e26cea1d9ff7239d749a89ed35e9b4de8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19427) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly
hudi-bot commented on PR #9422: URL: https://github.com/apache/hudi/pull/9422#issuecomment-1690448536 ## CI report: * 42d026cd694d6368e45b058a4ff7a9bd36b0d3a2 UNKNOWN * baf417339e41d1eb4bad6b351162e0457a9965ad Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19425) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] CTTY commented on pull request #9444: [HUDI-6692] Don't default to bulk insert on nonpkless table if recordkey is omitted
CTTY commented on PR #9444: URL: https://github.com/apache/hudi/pull/9444#issuecomment-1690447794 Hi @nsivabalan, I checked the code and it looks like an expected behavior. For non-java17 test it also fails and repeats for 204 times. If we want it to repeat less then maybe we can look into reducing the `retryLoop` number at this [line](https://github.com/apache/hudi/blob/87a187a91d472f16d569b6388098d27a3aede760/hudi-common/src/test/java/org/apache/hudi/common/fs/TestFSUtilsWithRetryWrapperEnable.java#L74) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] twlo-sandeep commented on pull request #9475: [MINOR] fixing mysql debezium data loss
twlo-sandeep commented on PR #9475: URL: https://github.com/apache/hudi/pull/9475#issuecomment-1690398458 @rmahindra123 - can you please take a look at this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled
hudi-bot commented on PR #9511: URL: https://github.com/apache/hudi/pull/9511#issuecomment-1690374678 ## CI report: * 350a0a363d2ebaa995f99edf2bebd88fe5056a27 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19429) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9509: [MINOR] Add operationType in alter table commit metadata
hudi-bot commented on PR #9509: URL: https://github.com/apache/hudi/pull/9509#issuecomment-1690374607 ## CI report: * 96700cfb604e637219f231d28a84dfcf033112b9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19424) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled
hudi-bot commented on PR #9511: URL: https://github.com/apache/hudi/pull/9511#issuecomment-1690312328 ## CI report: * 350a0a363d2ebaa995f99edf2bebd88fe5056a27 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6741) Timeline server cannot handle multiple base paths when metadata table is enabled
[ https://issues.apache.org/jira/browse/HUDI-6741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6741: - Labels: pull-request-available (was: ) > Timeline server cannot handle multiple base paths when metadata table is > enabled > > > Key: HUDI-6741 > URL: https://issues.apache.org/jira/browse/HUDI-6741 > Project: Apache Hudi > Issue Type: Bug >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > Labels: pull-request-available > > The Timeline Server will take in a view manager to gather the information > about the tables. When the metadata table is enabled, there is a supplier > that will be called to get the > HoodieTableMetadata. That supplier is configured for a single base path but > the timeline server can be used for multiple tables. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] the-other-tim-brown opened a new pull request, #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled
the-other-tim-brown opened a new pull request, #9511: URL: https://github.com/apache/hudi/pull/9511 ### Change Logs Updates the logic for creating the `HoodieTableMetadata` in the `FileSystemViewManager` to be a function of the `HoodieTableMetaClient` instead of simply a supplier. This allows us to use the current basePath instead of using the base path that is available when creating the `FileSystemViewManager` which enables a single view manager to work properly with multiple tables when metadata table is enabled for those tables. ### Impact Fixes bug impacting use of a single timeline server for multiple tables with metadata table enabled for those tables. ### Risk level (write none, low medium or high below) low, small risk of changing the signature of a public method for those that may be calling it outside of the Hudi project. Otherwise, all changes are hidden from the user. ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-6741) Timeline server cannot handle multiple base paths when metadata table is enabled
[ https://issues.apache.org/jira/browse/HUDI-6741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-6741: --- Assignee: Timothy Brown > Timeline server cannot handle multiple base paths when metadata table is > enabled > > > Key: HUDI-6741 > URL: https://issues.apache.org/jira/browse/HUDI-6741 > Project: Apache Hudi > Issue Type: Bug >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > The Timeline Server will take in a view manager to gather the information > about the tables. When the metadata table is enabled, there is a supplier > that will be called to get the > HoodieTableMetadata. That supplier is configured for a single base path but > the timeline server can be used for multiple tables. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6741) Timeline server cannot handle multiple base paths when metadata table is enabled
Timothy Brown created HUDI-6741: --- Summary: Timeline server cannot handle multiple base paths when metadata table is enabled Key: HUDI-6741 URL: https://issues.apache.org/jira/browse/HUDI-6741 Project: Apache Hudi Issue Type: Bug Reporter: Timothy Brown The Timeline Server will take in a view manager to gather the information about the tables. When the metadata table is enabled, there is a supplier that will be called to get the HoodieTableMetadata. That supplier is configured for a single base path but the timeline server can be used for multiple tables. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario
hudi-bot commented on PR #9468: URL: https://github.com/apache/hudi/pull/9468#issuecomment-1690218261 ## CI report: * 599baa2fe937dd35d24fcb2315e99d084cf31c10 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19360) * 6451d5f7db3326eb691185ec1b4351dc6cbbdd1b UNKNOWN * 9b2af18e26cea1d9ff7239d749a89ed35e9b4de8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19427) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9510: [HUDI-6739] Skip checking Active Timeline to improve performance for multiple streaming writers
hudi-bot commented on PR #9510: URL: https://github.com/apache/hudi/pull/9510#issuecomment-1690206192 ## CI report: * caf735fd3d8cba7d7e05ee7c8a08fcc6d810fb3f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19428) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly
hudi-bot commented on PR #9422: URL: https://github.com/apache/hudi/pull/9422#issuecomment-1690205649 ## CI report: * 42d026cd694d6368e45b058a4ff7a9bd36b0d3a2 UNKNOWN * 37e7683cbab44901aaeab3c08df772b4eda74590 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19413) * baf417339e41d1eb4bad6b351162e0457a9965ad Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19425) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario
hudi-bot commented on PR #9468: URL: https://github.com/apache/hudi/pull/9468#issuecomment-1690205920 ## CI report: * 599baa2fe937dd35d24fcb2315e99d084cf31c10 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19360) * 6451d5f7db3326eb691185ec1b4351dc6cbbdd1b UNKNOWN * 9b2af18e26cea1d9ff7239d749a89ed35e9b4de8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9482: [HUDI-6728] Update BigQuery manifest sync to support schema evolution
hudi-bot commented on PR #9482: URL: https://github.com/apache/hudi/pull/9482#issuecomment-1690189441 ## CI report: * 6eca81567a20e35f4f89ac2267fb7d538e09feec Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19381) * 39166302aadd51524e017f92a883e960e07a37a4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19426) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario
hudi-bot commented on PR #9468: URL: https://github.com/apache/hudi/pull/9468#issuecomment-1690189258 ## CI report: * 599baa2fe937dd35d24fcb2315e99d084cf31c10 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19360) * 6451d5f7db3326eb691185ec1b4351dc6cbbdd1b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ad1happy2go commented on issue #9397: [SUPPORT] column_stats index filtering returns empty results
ad1happy2go commented on issue #9397: URL: https://github.com/apache/hudi/issues/9397#issuecomment-169010 @nandubatchu I was not able to reproduce this issue with sample dataset. Can you share your table configs. Below the the code I was trying to use. I confirmed its using col stats for one of the column. Please provide some more insights about your data to help us to triage it better. ``` create table issue_col_stats_issue_9397_1 ( id string, name string, col1 int, col2 int, col3 int, updated_at string ) using hudi tblproperties ( type = 'cow', primaryKey = 'id', preCombineField = 'updated_at', hoodie.metadata.index.column.stats.enable = 'true', hoodie.metadata.index.column.stats.column.list = 'col1' ) location 'file:///tmp/issue_col_stats_issue_9397_1'; INSERT INTO issue_col_stats_issue_9397_1 VALUES ('1', 'c',1,1,1,'2023-01-01 00:00:00'), ('2', 'c',2,2,2,'2023-01-01 00:00:00'), ('3', 'c',3,3,3,'2023-01-01 00:00:00'); -- Verify if col stats present at /tmp/issue_col_stats_issue_9397_1/.hoodie/metadata/ select * from issue_col_stats_issue_9397_1 where col1 = 1 and col2 = 1 and col3 = 1; -- Correct Results select * from issue_col_stats_issue_9397_1 where col1 > 1 and col2 > 1 and col3 > 1; -- Correct Results ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9510: [HUDI-6739] Skip checking Active Timeline to improve performance for multiple streaming writers
hudi-bot commented on PR #9510: URL: https://github.com/apache/hudi/pull/9510#issuecomment-1690189633 ## CI report: * caf735fd3d8cba7d7e05ee7c8a08fcc6d810fb3f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly
hudi-bot commented on PR #9422: URL: https://github.com/apache/hudi/pull/9422#issuecomment-1690189040 ## CI report: * 42d026cd694d6368e45b058a4ff7a9bd36b0d3a2 UNKNOWN * 37e7683cbab44901aaeab3c08df772b4eda74590 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19413) * baf417339e41d1eb4bad6b351162e0457a9965ad UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario
nsivabalan commented on code in PR #9468: URL: https://github.com/apache/hudi/pull/9468#discussion_r1303181318 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java: ## @@ -278,4 +280,14 @@ public HoodieCleanMetadata execute() { // This requires the CleanActionExecutor to be refactored as BaseCommitActionExecutor return cleanMetadataList.size() > 0 ? cleanMetadataList.get(cleanMetadataList.size() - 1) : null; } + + private void checkIfOtherWriterCommitted(HoodieInstant hoodieInstant, HoodieIOException e) { +table.getMetaClient().reloadActiveTimeline(); +if (table.getCleanTimeline().filterCompletedInstants().containsInstant(hoodieInstant.getTimestamp())) { + LOG.warn("Clean operation was completed by another writer for instant: " + hoodieInstant); +} else { + LOG.warn("Failed to perform previous clean operation, instant: " + hoodieInstant, e); Review Comment: LOG.error() -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9509: [MINOR] Add operationType in alter table commit metadata
hudi-bot commented on PR #9509: URL: https://github.com/apache/hudi/pull/9509#issuecomment-1690114750 ## CI report: * 96700cfb604e637219f231d28a84dfcf033112b9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19424) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9482: [HUDI-6728] Update BigQuery manifest sync to support schema evolution
hudi-bot commented on PR #9482: URL: https://github.com/apache/hudi/pull/9482#issuecomment-1690114615 ## CI report: * 6eca81567a20e35f4f89ac2267fb7d538e09feec Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19381) * 39166302aadd51524e017f92a883e960e07a37a4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6739) Avoid checking timeline for successful commits for spark structured streaming when offset is 0
[ https://issues.apache.org/jira/browse/HUDI-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6739: - Labels: pull-request-available (was: ) > Avoid checking timeline for successful commits for spark structured streaming > when offset is 0 > -- > > Key: HUDI-6739 > URL: https://issues.apache.org/jira/browse/HUDI-6739 > Project: Apache Hudi > Issue Type: Improvement > Components: spark >Reporter: Aditya Goenka >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > > For Spark Structured Streaming writer, for new writer we are checking the > active timeline to identify the last successful commit and fix idempotency. > No need to check the checkpoint when the offset is 0 as that will mean it > will be the new writer. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] ad1happy2go opened a new pull request, #9510: [HUDI-6739] Skip checking Active Timeline to improve performance
ad1happy2go opened a new pull request, #9510: URL: https://github.com/apache/hudi/pull/9510 ### Change Logs Improved the logic for Fixing checkpoint management for multiple streaming writers, so that it dont check the active timeline if offset is 0 as that will mean it is a new writer. ### Impact none ### Risk level (write none, low medium or high below) low ### Documentation Update none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9509: [MINOR] Add operationType in alter table commit metadata
hudi-bot commented on PR #9509: URL: https://github.com/apache/hudi/pull/9509#issuecomment-1690099343 ## CI report: * 96700cfb604e637219f231d28a84dfcf033112b9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9485: [HUDI-6730] Enable hoodie configuration using the --conf option with the "spark." prefix
hudi-bot commented on PR #9485: URL: https://github.com/apache/hudi/pull/9485#issuecomment-1690084274 ## CI report: * 48b4ac4196cd2546bca4a5c6f2c81a353c80cd8d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19422) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zouxxyy opened a new pull request, #9509: [MINOR] Add operationType in alter table commit metadata
Zouxxyy opened a new pull request, #9509: URL: https://github.com/apache/hudi/pull/9509 ### Change Logs Add `operationType` in alter table commit metadata before, xxx.commit ```json { "partitionToWriteStats" : { }, "compacted" : false, "extraMetadata" : { "schema" : "xxx", "latest_schema" : "xxx" }, "operationType" : null } ``` after, xxx.commit ```json { "partitionToWriteStats" : { }, "compacted" : false, "extraMetadata" : { "schema" : "xxx", "latest_schema" : "xxx" }, "operationType" : "ALTER_SCHEMA" } ``` ### Impact Above ### Risk level (write none, low medium or high below) none ### Documentation Update none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9501: [HUDI-6735] Adding support for snapshotLoadQuerySplitter for incremental sources.
hudi-bot commented on PR #9501: URL: https://github.com/apache/hudi/pull/9501#issuecomment-1689979230 ## CI report: * 2db1e3e6179a5b45172ea9abae0c869375fffc1b Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19421) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-6621] Fix downgrade handler for 0.14.0 (#9467)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 87a187a91d4 [HUDI-6621] Fix downgrade handler for 0.14.0 (#9467) 87a187a91d4 is described below commit 87a187a91d472f16d569b6388098d27a3aede760 Author: Lokesh Jain AuthorDate: Wed Aug 23 18:39:08 2023 +0530 [HUDI-6621] Fix downgrade handler for 0.14.0 (#9467) - Since the log block version (due to delete block change) has been upgraded in 0.14.0, the delete blocks can not be read in 0.13.0 or earlier. - Similarly the addition of record level index field in metadata table leads to column drop error on downgrade. The Jira aims to fix the downgrade handler to trigger compaction and delete metadata table if user wishes to downgrade from version six (0.14.0) to version 5 (0.13.0). --- .../table/upgrade/SixToFiveDowngradeHandler.java | 53 ++-- .../table/upgrade/SupportsUpgradeDowngrade.java| 3 + .../table/upgrade/FlinkUpgradeDowngradeHelper.java | 7 + .../table/upgrade/JavaUpgradeDowngradeHelper.java | 7 + .../table/upgrade/SparkUpgradeDowngradeHelper.java | 7 + .../hudi/table/upgrade/TestUpgradeDowngrade.java | 10 +- .../functional/TestSixToFiveDowngradeHandler.scala | 142 + 7 files changed, 211 insertions(+), 18 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SixToFiveDowngradeHandler.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SixToFiveDowngradeHandler.java index 228c0f710a8..4793f368f81 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SixToFiveDowngradeHandler.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SixToFiveDowngradeHandler.java @@ -18,19 +18,26 @@ package org.apache.hudi.table.upgrade; +import org.apache.hudi.client.BaseHoodieWriteClient; import org.apache.hudi.common.config.ConfigProperty; +import org.apache.hudi.common.config.HoodieMetadataConfig; import org.apache.hudi.common.engine.HoodieEngineContext; +import org.apache.hudi.common.model.HoodieTableType; import org.apache.hudi.common.table.HoodieTableConfig; import org.apache.hudi.common.table.HoodieTableMetaClient; -import org.apache.hudi.common.table.HoodieTableVersion; import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator; import org.apache.hudi.common.table.timeline.HoodieTimeline; import org.apache.hudi.common.util.FileIOUtils; import org.apache.hudi.common.util.Option; +import org.apache.hudi.config.HoodieCompactionConfig; import org.apache.hudi.config.HoodieWriteConfig; -import org.apache.hudi.metadata.MetadataPartitionType; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.metadata.HoodieTableMetadataUtil; import org.apache.hudi.table.HoodieTable; +import org.apache.hudi.table.action.compact.CompactionTriggerStrategy; +import org.apache.hudi.table.action.compact.strategy.UnBoundedCompactionStrategy; import org.apache.hadoop.fs.Path; @@ -39,12 +46,15 @@ import java.util.Map; import static org.apache.hudi.common.table.HoodieTableConfig.TABLE_METADATA_PARTITIONS; import static org.apache.hudi.common.table.HoodieTableConfig.TABLE_METADATA_PARTITIONS_INFLIGHT; -import static org.apache.hudi.metadata.HoodieTableMetadataUtil.deleteMetadataTablePartition; /** * Downgrade handle to assist in downgrading hoodie table from version 6 to 5. * To ensure compatibility, we need recreate the compaction requested file to * .aux folder. + * Since version 6 includes a new schema field for metadata table(MDT), + * the MDT needs to be deleted during downgrade to avoid column drop error. + * Also log block version was upgraded in version 6, therefore full compaction needs + * to be completed during downgrade to avoid both read and future compaction failures. */ public class SixToFiveDowngradeHandler implements DowngradeHandler { @@ -52,11 +62,16 @@ public class SixToFiveDowngradeHandler implements DowngradeHandler { public Map downgrade(HoodieWriteConfig config, HoodieEngineContext context, String instantTime, SupportsUpgradeDowngrade upgradeDowngradeHelper) { final HoodieTable table = upgradeDowngradeHelper.getTable(config, context); -removeRecordIndexIfNeeded(table, context); +// Since version 6 includes a new schema field for metadata table(MDT), the MDT needs to be deleted during downgrade to avoid column drop error. +HoodieTableMetadataUtil.deleteMetadataTable(config.getBasePath(), context); +// The log block version has been upgraded in version six so compaction is required for downgrade. +runCompaction(table,
[GitHub] [hudi] nsivabalan merged pull request #9467: [HUDI-6621] Fix downgrade handler for 0.14.0
nsivabalan merged PR #9467: URL: https://github.com/apache/hudi/pull/9467 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Hans-Raintree opened a new issue, #9508: [SUPPORT] Reading CDC logs can't extract data type
Hans-Raintree opened a new issue, #9508: URL: https://github.com/apache/hudi/issues/9508 Describe the problem you faced When reading cdc logs in PySpark the before/after columns are returned in json format. It's difficult to convert them back to the correct data types, ie. a timestamp is returned as a number and if you want to convert it back to a timestamp you have to do something like this: df = df.withColumn(col_name, to_timestamp(from_unixtime(df[col_name]/100))) With decimal datatype it's stored as something like this [0, 0, 0, 0, 0, 0, 26, -27, -78], which is even more difficult to parse back to a number. I see that the data types are stored in the .cdc file, but I don't see how I can access them. To Reproduce Steps to reproduce the behavior: Write a table with 'hoodie.table.cdc.enabled': 'true', 'hoodie.table.cdc.supplemental.logging.mode': 'data_before_after' Read the cdc logs with: 'hoodie.datasource.query.type': 'incremental', 'hoodie.datasource.read.begin.instanttime': begin_time, 'hoodie.datasource.query.incremental.format': 'cdc' Expected behavior A way to extract columns from the before/after columns with the datatype. Environment Description Hudi version : 0.13.1 Spark version : 3.3.2 / 3.4.0 Hive version : 3.1.3 Hadoop version : 2.7 / 3.3.3 Storage (HDFS/S3/GCS..) : S3 / Local Running on Docker? (yes/no) : no Additional context Happens both in AWS EMR 6.12.0 and when running locally. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6740) Add 0.13.x to Spark 3 support matrix doc
[ https://issues.apache.org/jira/browse/HUDI-6740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6740: - Labels: pull-request-available (was: ) > Add 0.13.x to Spark 3 support matrix doc > > > Key: HUDI-6740 > URL: https://issues.apache.org/jira/browse/HUDI-6740 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Akira Ajisaka >Priority: Minor > Labels: pull-request-available > > Hudi 0.13.x is missing in the Spark 3 support matrix doc > [https://hudi.apache.org/docs/quick-start-guide/] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] aajisaka opened a new pull request, #9507: [HUDI-6740] Add 0.13.x to Spark 3 support matrix doc
aajisaka opened a new pull request, #9507: URL: https://github.com/apache/hudi/pull/9507 ### Change Logs Add Hudi 0.13.x to the Spark 3 support matrix doc https://hudi.apache.org/docs/quick-start-guide/ ### Impact Doc only change ### Risk level (write none, low medium or high below) low ### Documentation Update This is documentation only update ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6740) Add 0.13.x to Spark 3 support matrix doc
[ https://issues.apache.org/jira/browse/HUDI-6740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated HUDI-6740: Description: Hudi 0.13.x is missing in the Spark 3 support matrix doc [https://hudi.apache.org/docs/quick-start-guide/] (was: Hudi 0.13.x is missing in the Spark 3 matrix support doc [https://hudi.apache.org/docs/quick-start-guide/]) Summary: Add 0.13.x to Spark 3 support matrix doc (was: Add 0.13.x to Spark 3 matrix support doc) > Add 0.13.x to Spark 3 support matrix doc > > > Key: HUDI-6740 > URL: https://issues.apache.org/jira/browse/HUDI-6740 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Akira Ajisaka >Priority: Minor > > Hudi 0.13.x is missing in the Spark 3 support matrix doc > [https://hudi.apache.org/docs/quick-start-guide/] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6740) Add 0.13.x to Spark 3 matrix support doc
Akira Ajisaka created HUDI-6740: --- Summary: Add 0.13.x to Spark 3 matrix support doc Key: HUDI-6740 URL: https://issues.apache.org/jira/browse/HUDI-6740 Project: Apache Hudi Issue Type: Improvement Reporter: Akira Ajisaka Hudi 0.13.x is missing in the Spark 3 matrix support doc [https://hudi.apache.org/docs/quick-start-guide/] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6740) Add 0.13.x to Spark 3 matrix support doc
[ https://issues.apache.org/jira/browse/HUDI-6740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated HUDI-6740: Description: Hudi 0.13.x is missing in the Spark 3 matrix support doc [https://hudi.apache.org/docs/quick-start-guide/] (was: Hudi 0.13.x is missing in the Spark 3 matrix support doc [https://hudi.apache.org/docs/quick-start-guide/]) > Add 0.13.x to Spark 3 matrix support doc > > > Key: HUDI-6740 > URL: https://issues.apache.org/jira/browse/HUDI-6740 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Akira Ajisaka >Priority: Minor > > Hudi 0.13.x is missing in the Spark 3 matrix support doc > [https://hudi.apache.org/docs/quick-start-guide/] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9485: [HUDI-6730] Enable hoodie configuration using the --conf option with the "spark." prefix
hudi-bot commented on PR #9485: URL: https://github.com/apache/hudi/pull/9485#issuecomment-1689818375 ## CI report: * bcbf6a91785f83ca29e61a923b51d6b429f22f01 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19406) * 48b4ac4196cd2546bca4a5c6f2c81a353c80cd8d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19422) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9485: [HUDI-6730] Enable hoodie configuration using the --conf option with the "spark." prefix
hudi-bot commented on PR #9485: URL: https://github.com/apache/hudi/pull/9485#issuecomment-1689807493 ## CI report: * bcbf6a91785f83ca29e61a923b51d6b429f22f01 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19406) * 48b4ac4196cd2546bca4a5c6f2c81a353c80cd8d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6739) Avoid checking timeline for successful commits for spark structured streaming when offset is 0
Aditya Goenka created HUDI-6739: --- Summary: Avoid checking timeline for successful commits for spark structured streaming when offset is 0 Key: HUDI-6739 URL: https://issues.apache.org/jira/browse/HUDI-6739 Project: Apache Hudi Issue Type: Improvement Components: spark Reporter: Aditya Goenka Fix For: 1.0.0 For Spark Structured Streaming writer, for new writer we are checking the active timeline to identify the last successful commit and fix idempotency. No need to check the checkpoint when the offset is 0 as that will mean it will be the new writer. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] JoshuaZhuCN commented on issue #9418: [SUPPORT] Hudi table does not support Spark SQL's cache table syntax
JoshuaZhuCN commented on issue #9418: URL: https://github.com/apache/hudi/issues/9418#issuecomment-1689701271 @ad1happy2go It's really strange. I tried it in my testing environment and there was no problem. The same code has problems in the production environment, so I'll investigate the differences in my environment first -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lokeshj1703 commented on pull request #9467: [HUDI-6621] Fix downgrade handler for 0.14.0
lokeshj1703 commented on PR #9467: URL: https://github.com/apache/hudi/pull/9467#issuecomment-1689684602 https://github.com/apache/hudi/assets/9255455/56fff9c4-70d3-4ce1-941d-e25a1e7539c2;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lokeshj1703 commented on pull request #9467: [HUDI-6621] Fix downgrade handler for 0.14.0
lokeshj1703 commented on PR #9467: URL: https://github.com/apache/hudi/pull/9467#issuecomment-1689683472 ![Uploading Screenshot 2023-08-23 at 3.40.17 PM.png…]() -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9501: [HUDI-6735] Adding support for snapshotLoadQuerySplitter for incremental sources.
hudi-bot commented on PR #9501: URL: https://github.com/apache/hudi/pull/9501#issuecomment-1689654573 ## CI report: * 37f1a8c2a46dcd7f7f549e2deffaaa85b65f58e1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19407) * 2db1e3e6179a5b45172ea9abae0c869375fffc1b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19421) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9501: [HUDI-6735] Adding support for snapshotLoadQuerySplitter for incremental sources.
hudi-bot commented on PR #9501: URL: https://github.com/apache/hudi/pull/9501#issuecomment-1689641469 ## CI report: * 37f1a8c2a46dcd7f7f549e2deffaaa85b65f58e1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19407) * 2db1e3e6179a5b45172ea9abae0c869375fffc1b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.
hudi-bot commented on PR #9472: URL: https://github.com/apache/hudi/pull/9472#issuecomment-1689626305 ## CI report: * d403d1696e76ee6daf6ad6ddc62e9a218a5accab Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19417) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] harsh1231 commented on a diff in pull request #9501: [HUDI-6735] Adding support for snapshotLoadQuerySplitter for incremental sources.
harsh1231 commented on code in PR #9501: URL: https://github.com/apache/hudi/pull/9501#discussion_r1302716600 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/SnapshotLoadQuerySplitter.java: ## @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities.sources.helpers; + +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.util.Option; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; + +/** + * Abstract splitter responsible for managing the snapshot load query operations. + */ +public abstract class SnapshotLoadQuerySplitter { Review Comment: Moved to sources package . it's only for snapshot load query of incremental sources , so name kept as it is . Added unit test for with test implementation , concrete implementation at this point . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] harsh1231 commented on a diff in pull request #9501: [HUDI-6735] Adding support for snapshotLoadQuerySplitter for incremental sources.
harsh1231 commented on code in PR #9501: URL: https://github.com/apache/hudi/pull/9501#discussion_r1302716600 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/SnapshotLoadQuerySplitter.java: ## @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities.sources.helpers; + +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.util.Option; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; + +/** + * Abstract splitter responsible for managing the snapshot load query operations. + */ +public abstract class SnapshotLoadQuerySplitter { Review Comment: Moved to sources package . Name should indicate it's only for snapshot load query of incremental sources , so kept as it is . Added unit test for with test implementation , concrete implementation at this point . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-4756) Clean up usages of "assume.date.partition" config within hudi
[ https://issues.apache.org/jira/browse/HUDI-4756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757870#comment-17757870 ] Lin Liu commented on HUDI-4756: --- Realized that there are too many unit tests that need debugging, and the efforts for this issue is uncertain. Therefore, will work on this issue in the background, and prioritize other tasks for now. > Clean up usages of "assume.date.partition" config within hudi > - > > Key: HUDI-4756 > URL: https://issues.apache.org/jira/browse/HUDI-4756 > Project: Apache Hudi > Issue Type: Improvement > Components: configs >Reporter: sivabalan narayanan >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > looks like "assume.date.partition" is not used anywhere within hudi. lets > clean up the usages. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6701) Explore use of UUID-6/7 as a replacement for current auto generated keys
[ https://issues.apache.org/jira/browse/HUDI-6701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu closed HUDI-6701. - Resolution: Information Provided > Explore use of UUID-6/7 as a replacement for current auto generated keys > > > Key: HUDI-6701 > URL: https://issues.apache.org/jira/browse/HUDI-6701 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Assignee: Lin Liu >Priority: Major > Fix For: 1.0.0 > > > Today, we auto generate string keys of the form > (HoodieRecord#generateSequenceId), which is highly compressible, esp compared > to uuidv1, when we store as a string column inside a parquet file. > {code:java} > public static String generateSequenceId(String instantTime, int > partitionId, long recordIndex) { > return instantTime + "_" + partitionId + "_" + recordIndex; > } > {code} > As a part of this task, we'd love to understand if > - Can uuid6 or 7, provide similar compressed storage footprint when written > as a column in a parquet file. > - can the current format be represented as a 160-bit number i.e 2 longs, 1 > int in storage? would that save us further in storage costs? > (Orthogonal consideration is the memory needed to hold the key string, which > can be higher than a 160bits. We can discuss this later, once we understand > storage footprint) > > Resources: > * https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/09/ > * https://github.com/uuid6/uuid6-ietf-draft > * https://github.com/uuid6/prototypes -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6701) Explore use of UUID-6/7 as a replacement for current auto generated keys
[ https://issues.apache.org/jira/browse/HUDI-6701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757867#comment-17757867 ] Lin Liu commented on HUDI-6701: --- Based on our discussion and experiments, we conclude that we will keep the current row key format since it reaches the sweet spot between time and storage tradeoffs. This task is closed for now. > Explore use of UUID-6/7 as a replacement for current auto generated keys > > > Key: HUDI-6701 > URL: https://issues.apache.org/jira/browse/HUDI-6701 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Assignee: Lin Liu >Priority: Major > Fix For: 1.0.0 > > > Today, we auto generate string keys of the form > (HoodieRecord#generateSequenceId), which is highly compressible, esp compared > to uuidv1, when we store as a string column inside a parquet file. > {code:java} > public static String generateSequenceId(String instantTime, int > partitionId, long recordIndex) { > return instantTime + "_" + partitionId + "_" + recordIndex; > } > {code} > As a part of this task, we'd love to understand if > - Can uuid6 or 7, provide similar compressed storage footprint when written > as a column in a parquet file. > - can the current format be represented as a 160-bit number i.e 2 longs, 1 > int in storage? would that save us further in storage costs? > (Orthogonal consideration is the memory needed to hold the key string, which > can be higher than a 160bits. We can discuss this later, once we understand > storage footprint) > > Resources: > * https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/09/ > * https://github.com/uuid6/uuid6-ietf-draft > * https://github.com/uuid6/prototypes -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated (c45c3782d33 -> 518d8728b1f)
This is an automated email from the ASF dual-hosted git repository. codope pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from c45c3782d33 [HUDI-6549] Add support for comma separated path format for spark.read.load (#9503) add 518d8728b1f [HUDI-4115] Adding support for schema while loading spark dataset in S3/GCS source (#9502) No new revisions were added by this update. Summary of changes: .../sources/GcsEventsHoodieIncrSource.java | 5 +++- .../sources/S3EventsHoodieIncrSource.java | 5 +++- .../sources/helpers/CloudDataFetcher.java | 6 ++-- .../helpers/CloudObjectsSelectorCommon.java| 17 ++- .../sources/TestGcsEventsHoodieIncrSource.java | 34 +++--- .../sources/TestS3EventsHoodieIncrSource.java | 28 +- .../helpers/TestCloudObjectsSelectorCommon.java| 17 +++ .../test/resources/schema/sample_data_schema.avsc | 16 ++ .../src/test/resources/schema/sample_gcs_data.avsc | 20 - 9 files changed, 111 insertions(+), 37 deletions(-) copy hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BucketType.java => hudi-utilities/src/test/resources/schema/sample_data_schema.avsc (81%) copy hudi-common/src/test/resources/simple-test.avsc => hudi-utilities/src/test/resources/schema/sample_gcs_data.avsc (78%)
[GitHub] [hudi] codope merged pull request #9502: [HUDI-4115] Adding support for schema while loading spark dataset in s3/gcs source
codope merged PR #9502: URL: https://github.com/apache/hudi/pull/9502 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9502: [HUDI-4115] Adding support for schema while loading spark dataset in s3/gcs source
hudi-bot commented on PR #9502: URL: https://github.com/apache/hudi/pull/9502#issuecomment-1689450574 ## CI report: * 63342240f5d36496ac7ac12565878ae661600ec2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19408) * 0833c22802627298ba32024eed36d4114204a389 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19419) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on pull request #9502: [HUDI-4115] Adding support for schema while loading spark dataset in s3/gcs source
codope commented on PR #9502: URL: https://github.com/apache/hudi/pull/9502#issuecomment-1689449336 Azure CI passed https://github.com/apache/hudi/assets/16440354/fbf410cd-f650-430f-94f6-da110d578ceb;> GH actions were already green last time and only license header was added in the subsequent commit. Landing the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9502: [HUDI-4115] Adding support for schema while loading spark dataset in s3/gcs source
hudi-bot commented on PR #9502: URL: https://github.com/apache/hudi/pull/9502#issuecomment-1689439250 ## CI report: * 63342240f5d36496ac7ac12565878ae661600ec2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19408) * 0833c22802627298ba32024eed36d4114204a389 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] harsh1231 commented on a diff in pull request #9501: [HUDI-6735] Adding support for snapshotLoadQuerySplitter for incremental sources.
harsh1231 commented on code in PR #9501: URL: https://github.com/apache/hudi/pull/9501#discussion_r1302582210 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java: ## @@ -128,6 +133,10 @@ public static class Config { public HoodieIncrSource(TypedProperties props, JavaSparkContext sparkContext, SparkSession sparkSession, SchemaProvider schemaProvider) { super(props, sparkContext, sparkSession, schemaProvider); + +this.snapshotLoadQuerySplitter = Option.ofNullable(props.getString(SNAPSHOT_LOAD_QUERY_SPLITTER_CLASS_NAME, null)) +.map(className -> (SnapshotLoadQuerySplitter) ReflectionUtils.loadClass(className, + new Class[]{TypedProperties.class}, props)); Review Comment: yeah this will fallback to regular pipeline for null , does not exist we should keep i think , ( could be jar issue ) -> fail pipeline if class not found . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] harsh1231 commented on a diff in pull request #9501: [HUDI-6735] Adding support for snapshotLoadQuerySplitter for incremental sources.
harsh1231 commented on code in PR #9501: URL: https://github.com/apache/hudi/pull/9501#discussion_r1302582210 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java: ## @@ -128,6 +133,10 @@ public static class Config { public HoodieIncrSource(TypedProperties props, JavaSparkContext sparkContext, SparkSession sparkSession, SchemaProvider schemaProvider) { super(props, sparkContext, sparkSession, schemaProvider); + +this.snapshotLoadQuerySplitter = Option.ofNullable(props.getString(SNAPSHOT_LOAD_QUERY_SPLITTER_CLASS_NAME, null)) +.map(className -> (SnapshotLoadQuerySplitter) ReflectionUtils.loadClass(className, + new Class[]{TypedProperties.class}, props)); Review Comment: yeah this will fallback to regular pipeline for null , does not exist we should keep i think , ( could be jar issue ) . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on a diff in pull request #9501: [HUDI-6735] Adding support for snapshotLoadQuerySplitter for incremental sources.
codope commented on code in PR #9501: URL: https://github.com/apache/hudi/pull/9501#discussion_r1302576522 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/SnapshotLoadQuerySplitter.java: ## @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities.sources.helpers; + +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.util.Option; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; + +/** + * Abstract splitter responsible for managing the snapshot load query operations. + */ +public abstract class SnapshotLoadQuerySplitter { Review Comment: 1. This is not a helper. Move the abstraction out of `helpers` package. 2. How about keeping the base class name simply `IncrementalLoadSplitter` as it is used only for incremental source? 3. Why are we not providing any concrete out-of-box implementation? 4. Can we add unit test for the splitter `getNextCheckpoint`? ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java: ## @@ -128,6 +133,10 @@ public static class Config { public HoodieIncrSource(TypedProperties props, JavaSparkContext sparkContext, SparkSession sparkSession, SchemaProvider schemaProvider) { super(props, sparkContext, sparkSession, schemaProvider); + +this.snapshotLoadQuerySplitter = Option.ofNullable(props.getString(SNAPSHOT_LOAD_QUERY_SPLITTER_CLASS_NAME, null)) +.map(className -> (SnapshotLoadQuerySplitter) ReflectionUtils.loadClass(className, + new Class[]{TypedProperties.class}, props)); Review Comment: If it's null or does not exist, should we fallback to regular path instead of breaking the pipeline? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6549) Add support for comma separated read path format in CloudObjectsSelectorCommon
[ https://issues.apache.org/jira/browse/HUDI-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit closed HUDI-6549. - Fix Version/s: 0.14.0 Resolution: Fixed > Add support for comma separated read path format in > CloudObjectsSelectorCommon > --- > > Key: HUDI-6549 > URL: https://issues.apache.org/jira/browse/HUDI-6549 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Harshal Patil >Assignee: Harshal Patil >Priority: Minor > Labels: pull-request-available > Fix For: 0.14.0 > > > For file formats like xml , comma separated path formats are supported for > loading data from more than 1 file . > It does not support array of strings in spark.read.load() . -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] codope merged pull request #9503: [HUDI-6549] Adding support for comma separated path format for spark.read.load
codope merged PR #9503: URL: https://github.com/apache/hudi/pull/9503 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (cbd6f91991f -> c45c3782d33)
This is an automated email from the ASF dual-hosted git repository. codope pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from cbd6f91991f [HUDI-6692] Don't default to bulk insert on nonpkless table if recordkey is omitted (#9444) add c45c3782d33 [HUDI-6549] Add support for comma separated path format for spark.read.load (#9503) No new revisions were added by this update. Summary of changes: .../sources/helpers/CloudObjectsSelectorCommon.java | 11 ++- .../utilities/sources/helpers/CloudStoreIngestionConfig.java | 12 .../sources/helpers/TestCloudObjectsSelectorCommon.java | 1 + 3 files changed, 23 insertions(+), 1 deletion(-)