date:20230823

[GitHub] [hudi] imrewang commented on issue #9513: [SUPPORT]Index Bootstrap deleted snapshot data that has been batch-inserted into Hudi ?

2023-08-23 Thread via GitHub



imrewang commented on issue #9513:
URL: https://github.com/apache/hudi/issues/9513#issuecomment-1691036778

   Delete some of them, but not all of them
   
   Just to implement dynamic update of data
   
   For example, in the `incremental data`, there is a deletion of a row of 
`snapshot data`, and now it is required to delete the corresponding row in the 
`snapshot data` that has been stored in Hudi


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on pull request #9501: [HUDI-6735] Adding support for snapshotLoadQuerySplitter for incremental sources.

2023-08-23 Thread via GitHub



codope commented on PR #9501:
URL: https://github.com/apache/hudi/pull/9501#issuecomment-1691022209

   @harsh1231 `ITTestHoodieSyncCommand#testValidateSync` failed in GH actions. 
Can you please check? I've retrigged the Spark 3.0 tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-6743) Use current instant time as part of the log file name

2023-08-23 Thread Danny Chen (Jira)

Danny Chen created HUDI-6743:


 Summary: Use current instant time as part of the log file name
 Key: HUDI-6743
 URL: https://issues.apache.org/jira/browse/HUDI-6743
 Project: Apache Hudi
  Issue Type: Improvement
  Components: core
Reporter: Danny Chen
Assignee: Danny Chen






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-6742) Remove the log file appending for multiple instants

2023-08-23 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen reassigned HUDI-6742:


Assignee: Danny Chen

> Remove the log file appending for multiple instants
> ---
>
> Key: HUDI-6742
> URL: https://issues.apache.org/jira/browse/HUDI-6742
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-6742) Remove the log file appending for multiple instants

2023-08-23 Thread Danny Chen (Jira)

Danny Chen created HUDI-6742:


 Summary: Remove the log file appending for multiple instants
 Key: HUDI-6742
 URL: https://issues.apache.org/jira/browse/HUDI-6742
 Project: Apache Hudi
  Issue Type: Improvement
  Components: core
Reporter: Danny Chen
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6742) Remove the log file appending for multiple instants

2023-08-23 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-6742:
-
Epic Link: HUDI-6640
Reviewers: Vinoth Chandar

> Remove the log file appending for multiple instants
> ---
>
> Key: HUDI-6742
> URL: https://issues.apache.org/jira/browse/HUDI-6742
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9511:
URL: https://github.com/apache/hudi/pull/9511#issuecomment-1690992353

   
   ## CI report:
   
   * 350a0a363d2ebaa995f99edf2bebd88fe5056a27 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19429)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9511:
URL: https://github.com/apache/hudi/pull/9511#issuecomment-1690986681

   
   ## CI report:
   
   * 350a0a363d2ebaa995f99edf2bebd88fe5056a27 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #9513: [SUPPORT]Index Bootstrap deleted snapshot data that has been batch-inserted into Hudi ?

2023-08-23 Thread via GitHub



danny0405 commented on issue #9513:
URL: https://github.com/apache/hudi/issues/9513#issuecomment-1690982914

   Why you wanna delete the snapshot data, then why you import it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] TengHuo commented on a diff in pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.

2023-08-23 Thread via GitHub



TengHuo commented on code in PR #9472:
URL: https://github.com/apache/hudi/pull/9472#discussion_r1303746084


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ReplaceCommitValidateUtil.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.cluster;
+
+import org.apache.hudi.common.model.HoodieReplaceCommitMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.JsonUtils;
+import org.apache.hudi.exception.HoodieException;
+
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.util.FileIOUtils.LOG;
+
+public class ReplaceCommitValidateUtil {
+  public static final String REPLACE_COMMIT_FILE_IDS = "replaceCommitFileIds";
+  public static void validateReplaceCommit(HoodieTableMetaClient metaClient) {
+metaClient.reloadActiveTimeline();
+Set replaceFileids = new HashSet<>();
+
+// Verify pending and completed replace commit
+
Stream.concat(metaClient.getActiveTimeline().getCompletedReplaceTimeline().getInstantsAsStream(),
+
metaClient.getActiveTimeline().filterInflights().filterPendingReplaceTimeline().getInstantsAsStream()).map(instant
 -> {
+  try {
+HoodieReplaceCommitMetadata replaceCommitMetadata =
+
HoodieReplaceCommitMetadata.fromBytes(metaClient.getActiveTimeline().getInstantDetails(instant).get(),
+HoodieReplaceCommitMetadata.class);

Review Comment:
   I think this issue is caused by different writers isolation from each other, 
and there is no lock in Hudi timeline, so when one writer is trying to 
`validateReplaceCommit`, it's impossible to guarantee that there is no other 
writer is modifying the same timeline.
   
   As @boneanxs mentioned, there could be other situation such as two `insert 
overwrite` running together, or `compaction` and `clustering` generating plan 
at the same time, the current timeline can't provide enough information (any 
partitions modified or dropped) to each writer to prevent these concurrent 
issue.
   
   In Hudi 
[RFC-56](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md), OCC 
was introduced for solving write conflicts in file group level, but no conflict 
check in partition level. So I think this partition level conflict issue should 
be solved in OCC framework, it's better because all similar situation can be 
solved as long as OCC enabled.
   
   cc @voonhous 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] majian1998 commented on a diff in pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.

2023-08-23 Thread via GitHub



majian1998 commented on code in PR #9472:
URL: https://github.com/apache/hudi/pull/9472#discussion_r1303744014


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ReplaceCommitValidateUtil.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.cluster;
+
+import org.apache.hudi.common.model.HoodieReplaceCommitMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.JsonUtils;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.util.FileIOUtils.LOG;
+
+public class ReplaceCommitValidateUtil {
+  public static final String REPLACE_COMMIT_FILE_IDS = "replaceCommitFileIds";

Review Comment:
   I don't think this situation will happen:
   Because my validation was done after the inflight commit was written to 
disk. Suppose two inflight replace commits happen at the same time:
   1. Simultaneous disk write: Even if the disk write happens at the same time, 
when validation begins, it will reload first, so it will definitely see the 
file that the other party has written to disk.
   2. Simultaneous validation: This is even if both have written to disk, both 
validation will find each other and report errors.
   3. One party validates while the other writes to disk: This means that the 
verifying party has already completed the disk write, so the committing party 
will definitely be found.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] LinMingQiang commented on pull request #7469: [HUDI-5386] Cleaning conflicts when write concurrency mode is OCC

2023-08-23 Thread via GitHub



LinMingQiang commented on PR #7469:
URL: https://github.com/apache/hudi/pull/7469#issuecomment-1690920181

   I can get what you mean,but using the heartbeat mechanism might be too 
heavy. Maybe we can add a ticket and ask the writer who obtains the ticket to 
perform rollback ,It's like a lock but it doesn't affect other writers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] LinMingQiang opened a new pull request, #7469: [HUDI-5386] Cleaning conflicts when write concurrency mode is OCC

2023-08-23 Thread via GitHub



LinMingQiang opened a new pull request, #7469:
URL: https://github.com/apache/hudi/pull/7469

   ### Change Logs
   
   getInstantsToRollback should be lock too.
   
   ### Impact
   
   occ mode.
   
   ### Risk level (write none, low medium or high below)
   none
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [x] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] LzMingYueShanPao commented on issue #143: Tracking ticket for folks to be added to slack group

2023-08-23 Thread via GitHub



LzMingYueShanPao commented on issue #143:
URL: https://github.com/apache/hudi/issues/143#issuecomment-1690915753

   Can you add me to the slack? dingding1035807...@dingtalk.com


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] LinMingQiang commented on pull request #7469: [HUDI-5386] Cleaning conflicts when write concurrency mode is OCC

2023-08-23 Thread via GitHub



LinMingQiang commented on PR #7469:
URL: https://github.com/apache/hudi/pull/7469#issuecomment-1690915756

   > if I am not wrong, this is the core problem we are trying to solve. if 
there are failed commits and if two concurrent writers try to rollback 
concurrently we don't have a lock as such.
   > 
   > These complications arise just bcoz, hudi tries to do automatic clean up 
of failed writes. In other similar systems, you may have to trigger explicit 
commands to clean up partially failed commits. or coordinate when multiple 
writers are involved.
   > 
   > Wanted to call it out. Anyways, coming back to the original issue. Its 
recommended to disable table services (like cleaner, archival) in all writers 
except 1. So, we won't end up in such conflicts. These are anyways not latency 
sensitive. And w/ this approach all other writes will be even more faster since 
they don't trigger any of these table service and only take care of ingestion.
   > 
   > We do have a table level config to disable all table services 
https://hudi.apache.org/docs/configurations/#hoodietableservicesenabled
   > 
   > Having said all this, here is what I feel we could fix this issue.
   > 
   > We can leverage the heartbeats, such that rollback commits also start to 
emit heartbeats. So, a concurrent writer know if some other writer is 
concurrently executing the rollback, or whether its in failed state. That way, 
only one writer will go ahead and execute the rollback while others will step 
away.
   > 
   > I remember @suryaprasanna wanted to fix this if I am not wrong.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] LinMingQiang closed pull request #7469: [HUDI-5386] Cleaning conflicts when write concurrency mode is OCC

2023-08-23 Thread via GitHub



LinMingQiang closed pull request #7469: [HUDI-5386] Cleaning conflicts when 
write concurrency mode is OCC
URL: https://github.com/apache/hudi/pull/7469


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] boneanxs commented on a diff in pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.

2023-08-23 Thread via GitHub



boneanxs commented on code in PR #9472:
URL: https://github.com/apache/hudi/pull/9472#discussion_r1303729529


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ReplaceCommitValidateUtil.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.cluster;
+
+import org.apache.hudi.common.model.HoodieReplaceCommitMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.JsonUtils;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.util.FileIOUtils.LOG;
+
+public class ReplaceCommitValidateUtil {
+  public static final String REPLACE_COMMIT_FILE_IDS = "replaceCommitFileIds";

Review Comment:
   How you validate it if such issue happens? Like 2 inflight commits written 
nearly at the same time, and each other doesn't see it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-23 Thread via GitHub



nsivabalan commented on code in PR #9422:
URL: https://github.com/apache/hudi/pull/9422#discussion_r1303719719


##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestMORColstats.java:
##
@@ -0,0 +1,481 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.functional;
+
+import org.apache.hudi.DataSourceReadOptions;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.client.SparkRDDWriteClient;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.testutils.HoodieTestDataGenerator;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieCompactionConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.testutils.HoodieSparkClientTestBase;
+
+import org.apache.spark.SparkException;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Properties;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static 
org.apache.hudi.common.testutils.RawTripTestPayload.recordToString;
+import static 
org.apache.hudi.config.HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS;
+import static org.apache.spark.sql.SaveMode.Append;
+import static org.apache.spark.sql.SaveMode.Overwrite;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertThrows;
+
+/**
+ * Test mor with colstats enabled in scenarios to ensure that files
+ * are being appropriately read or not read.
+ * The strategy employed is to corrupt targeted base files. If we want
+ * to prove the file is read, we assert that an exception will be thrown.
+ * If we want to prove the file is not read, we expect the read to
+ * successfully execute.
+ */
+public class TestMORColstats extends HoodieSparkClientTestBase {
+
+  private static String matchCond = "trip_type = 'UBERX'";
+  private static String nonMatchCond = "trip_type = 'BLACK'";
+  private static String[] dropColumns = {"_hoodie_commit_time", 
"_hoodie_commit_seqno",
+  "_hoodie_record_key", "_hoodie_partition_path", "_hoodie_file_name"};
+
+  private Boolean shouldOverwrite;
+  Map options;
+  @TempDir
+  public java.nio.file.Path basePath;
+
+  @BeforeEach
+  public void setUp() throws Exception {
+initSparkContexts();
+dataGen = new HoodieTestDataGenerator();
+shouldOverwrite = true;
+options = getOptions();
+Properties props = new Properties();
+props.putAll(options);
+try {
+  metaClient = HoodieTableMetaClient.initTableAndGetMetaClient(hadoopConf, 
basePath.toString(), props);
+} catch (IOException e) {
+  throw new RuntimeException(e);
+}
+  }
+
+  @AfterEach
+  public void tearDown() throws IOException {
+cleanupSparkContexts();
+cleanupTestDataGenerator();
+metaClient = null;
+  }
+
+  /**
+   * Create two files, one should be excluded by colstats
+   */
+  @Test
+  public void testBaseFileOnly() {
+Dataset inserts = makeInsertDf("000", 100);
+Dataset batch1 = inserts.where(matchCond);
+Dataset batch2 = inserts.where(nonMatchCond);
+doWrite(batch1);
+doWrite(batch2);
+List filesToCorrupt = getFilesToCorrupt();
+assertEquals(1, filesToCorrupt.size());
+filesToCorrupt.forEach(TestMORColstats::corruptFile);
+assertEquals(0, readMatchingRecords().except(batch1).count());
+//Read without data skipping to show that it will fail
+//Reading with data skipping succeeded so that means that data skipping is 
working and the corrupted
+//file

[GitHub] [hudi] nsivabalan merged pull request #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario

2023-08-23 Thread via GitHub



nsivabalan merged PR #9468:
URL: https://github.com/apache/hudi/pull/9468


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario (#9468)

2023-08-23 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new d7c16f56f4f [HUDI-6718] Check Timeline Before Transitioning Inflight 
Clean in Multiwriter Scenario (#9468)
d7c16f56f4f is described below

commit d7c16f56f4f9dfa3a160dac459ae11944f922ec8
Author: Jon Vexler 
AuthorDate: Wed Aug 23 22:30:41 2023 -0400

[HUDI-6718] Check Timeline Before Transitioning Inflight Clean in 
Multiwriter Scenario (#9468)

- If two cleans start at nearly the same time, they will both attempt to 
execute the same clean instances. This does not cause any data corruption, but 
will cause a writer to fail when they attempt to create the commit in the 
timeline. This is because the commit will have already been written by the 
first writer. Now, we check the timeline before transitioning state.

Co-authored-by: Jonathan Vexler <=>
---
 .../hudi/table/action/clean/CleanActionExecutor.java   | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
index 05e1056324a..c931e7bce9d 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
@@ -261,8 +261,10 @@ public class CleanActionExecutor extends 
BaseActionExecutor extends 
BaseActionExecutor 0 ? 
cleanMetadataList.get(cleanMetadataList.size() - 1) : null;
   }
+
+  private void checkIfOtherWriterCommitted(HoodieInstant hoodieInstant, 
HoodieIOException e) {
+table.getMetaClient().reloadActiveTimeline();
+if 
(table.getCleanTimeline().filterCompletedInstants().containsInstant(hoodieInstant.getTimestamp()))
 {
+  LOG.warn("Clean operation was completed by another writer for instant: " 
+ hoodieInstant);
+} else {
+  LOG.error("Failed to perform previous clean operation, instant: " + 
hoodieInstant, e);
+  throw e;
+}
+  }
 }

[GitHub] [hudi] nsivabalan commented on pull request #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario

2023-08-23 Thread via GitHub



nsivabalan commented on PR #9468:
URL: https://github.com/apache/hudi/pull/9468#issuecomment-1690896492

   https://github.com/apache/hudi/assets/513218/645da90d-5ae4-4d4f-8c0a-7b6cc49e0c0a;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9511:
URL: https://github.com/apache/hudi/pull/9511#issuecomment-1690872044

   
   ## CI report:
   
   * 350a0a363d2ebaa995f99edf2bebd88fe5056a27 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19429)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9511:
URL: https://github.com/apache/hudi/pull/9511#issuecomment-1690817940

   
   ## CI report:
   
   * 350a0a363d2ebaa995f99edf2bebd88fe5056a27 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19429)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9422:
URL: https://github.com/apache/hudi/pull/9422#issuecomment-1690817806

   
   ## CI report:
   
   * 42d026cd694d6368e45b058a4ff7a9bd36b0d3a2 UNKNOWN
   * 976734054ed50853f16196cd68f8a3c2a5976728 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19430)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Riddle4045 commented on issue #9505: [SUPPORT] Scheduling Compaction using Hudi CLI fails

2023-08-23 Thread via GitHub



Riddle4045 commented on issue #9505:
URL: https://github.com/apache/hudi/issues/9505#issuecomment-1690797389

   closing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Riddle4045 closed issue #9505: [SUPPORT] Scheduling Compaction using Hudi CLI fails

2023-08-23 Thread via GitHub



Riddle4045 closed issue #9505: [SUPPORT] Scheduling Compaction using Hudi CLI 
fails 
URL: https://github.com/apache/hudi/issues/9505


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Riddle4045 commented on issue #9505: [SUPPORT] Scheduling Compaction using Hudi CLI fails

2023-08-23 Thread via GitHub



Riddle4045 commented on issue #9505:
URL: https://github.com/apache/hudi/issues/9505#issuecomment-1690797249

   This worked with spark 3.4, 3.1 & 3.3 failed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9511:
URL: https://github.com/apache/hudi/pull/9511#issuecomment-1690787290

   
   ## CI report:
   
   * 350a0a363d2ebaa995f99edf2bebd88fe5056a27 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Jason-liujc opened a new issue, #9512: [SUPPORT] No table level lock when using DynamoDB lock provider

2023-08-23 Thread via GitHub

Jason-liujc opened a new issue, #9512:
URL: https://github.com/apache/hudi/issues/9512

**Describe the problem you faced**

We have a usecase where we need a table level lock for multiple EMR clusters
writing to the same Hudi table. We've tried the two options provided by Hudi
OCC guide page.

For `single_writer` option, we don't see any lock being created in DynamoDB,
therefore there were no write locks and the EMR Hudi upsert job fails due to
multiple writer writing to the same table.

For `optimistic_concurrency_control`, it's creating and deleting many locks
in DynamoDB entry. And the lock is not on table level. So when two jobs write
to the same file, one of the job will fail. This leads to us having to add a
lot of retries in our cluster, which is not ideal.

Is there any

**To Reproduce**

Steps to reproduce the behavior:

1. Run Hudi upsert jobs in multiple AWS EMR cluster writing to the same
table at the same time.
2. Use DynamoDB as the lock provider, these jobs write to the same
partitions.
3. Use `single_writer` + `hoodie.cleaner.policy.failed.writes=EAGER`or
`optimistic_concurrency_control` + `hoodie.cleaner.policy.failed.writes=LAZY`
as part of the Hudi write option.
4. See each time, the jobs will fail due to different errors.

**Expected behavior**

Hudi would provide a table level lock that allow us to block other writers
from writing until one of the writer is done when writing to the same partition.

**Environment Description**

* Hudi version : 0.13.0 (EMR 6.11)

* Spark version : 3.3.1

* Hive version : 3.1.3

* Hadoop version : 3.3.3

* Storage (HDFS/S3/GCS..) : S3

* Running on Docker? (yes/no) : No

**Additional context**

We are considering building something our own to enforce sequential
insertion during orchestration. But this can be avoided if Hudi provide a table
level lock.

**Stacktrace**

Single writer:

```

23/08/21 23:19:47 ERROR Client: Application diagnostics message: User class
threw exception: org.apache.hudi.exception.HoodieRollbackException: Failed to
rollback s3://bucket/xxxkey commits 20230821231625673
at
org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:823)
at
org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:727)
at
org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:711)
at
org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:706)
at
org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(BaseHoodieWriteClient.java:843)
```

Multi writer with `optimistic_concurrency_control`:

```

23/08/21 21:40:31 ERROR Client: Application diagnostics message: User class
threw exception: org.apache.hudi.exception.HoodieWriteConflictException:
java.util.ConcurrentModificationException: Cannot resolve conflicts for
overlapping writes
at
org.apache.hudi.client.transaction.SimpleConcurrentFileWritesConflictResolutionStrategy.resolveConflict(SimpleConcurrentFileWritesConflictResolutionStrategy.java:108)
at
org.apache.hudi.client.utils.TransactionUtils.lambda$resolveWriteConflictIfAny$0(TransactionUtils.java:85)
at
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
at
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
at
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
```

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9511:
URL: https://github.com/apache/hudi/pull/9511#issuecomment-1690681949

   
   ## CI report:
   
   * 350a0a363d2ebaa995f99edf2bebd88fe5056a27 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19429)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9510: [HUDI-6739] Skip checking Active Timeline to improve performance for multiple streaming writers

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9510:
URL: https://github.com/apache/hudi/pull/9510#issuecomment-1690681912

   
   ## CI report:
   
   * caf735fd3d8cba7d7e05ee7c8a08fcc6d810fb3f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19428)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled

2023-08-23 Thread via GitHub



the-other-tim-brown commented on code in PR #9511:
URL: https://github.com/apache/hudi/pull/9511#discussion_r1303570089


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java:
##
@@ -220,16 +220,15 @@ public static FileSystemViewManager 
createViewManager(final HoodieEngineContext
 final 
HoodieMetadataConfig metadataConfig,
 final 
FileSystemViewStorageConfig config,
 final 
HoodieCommonConfig commonConfig) {
-return createViewManager(context, metadataConfig, config, commonConfig, 
(SerializableSupplier) null);
+return createViewManager(context, metadataConfig, config, commonConfig, 
null);
   }
 
-  public static FileSystemViewManager createViewManager(final 
HoodieEngineContext context,
-final 
HoodieMetadataConfig metadataConfig,
-final 
FileSystemViewStorageConfig config,
-final 
HoodieCommonConfig commonConfig,
-final String basePath) 
{
+  public static FileSystemViewManager createViewManagerWithTableMetadata(final 
HoodieEngineContext context,
+ final 
HoodieMetadataConfig metadataConfig,
+ final 
FileSystemViewStorageConfig config,
+ final 
HoodieCommonConfig commonConfig) {
 return createViewManager(context, metadataConfig, config, commonConfig,

Review Comment:
   I think we would need a way to get this config at runtime but I don't see a 
way to do that with the current structure



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #7469: [HUDI-5386] Cleaning conflicts when write concurrency mode is OCC

2023-08-23 Thread via GitHub



nsivabalan commented on PR #7469:
URL: https://github.com/apache/hudi/pull/7469#issuecomment-1690616808

   if I am not wrong, this is the core problem we are trying to solve. 
   if there are failed commits and if two concurrent writers try to rollback 
concurrently we don't have a lock as such. 
   
   These complications arise just bcoz, hudi tries to do automatic clean up of 
failed writes. In other similar systems, you may have to trigger explicit 
commands to clean up partially failed commits. or coordinate when multiple 
writers are involved. 
   
   Wanted to call it out. 
   Anyways, coming back to the original issue. Its recommended to disable table 
services (like cleaner, archival) in all writers except 1. So, we won't end up 
in such conflicts. These are anyways not latency sensitive. And w/ this 
approach all other writes will be even more faster since they don't trigger any 
of these table service and only take care of ingestion. 
   
   We do have a table level config to disable all table services 
   https://hudi.apache.org/docs/configurations/#hoodietableservicesenabled 
   
   Having said all this, here is what I feel we could fix this issue. 
   
   We can leverage the heartbeats, such that rollback commits also start to 
emit heartbeats. So, a concurrent writer know if some other writer is 
concurrently executing the rollback, or whether its in failed state. That way, 
only one writer will go ahead and execute the rollback while others will step 
away. 
   
   I remember @suryaprasanna wanted to fix this if I am not wrong.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Riddle4045 commented on issue #9505: [SUPPORT] Scheduling Compaction using Hudi CLI fails

2023-08-23 Thread via GitHub



Riddle4045 commented on issue #9505:
URL: https://github.com/apache/hudi/issues/9505#issuecomment-1690607980

   @danny0405  is there any doc on what version of spark exactly works with 
`hudi-cli/hudi-cli.sh` I have tried both 2.x and 3.x and both have different 
dependency conflicts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on a diff in pull request #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled

2023-08-23 Thread via GitHub



yihua commented on code in PR #9511:
URL: https://github.com/apache/hudi/pull/9511#discussion_r1303505252


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java:
##
@@ -220,16 +220,15 @@ public static FileSystemViewManager 
createViewManager(final HoodieEngineContext
 final 
HoodieMetadataConfig metadataConfig,
 final 
FileSystemViewStorageConfig config,
 final 
HoodieCommonConfig commonConfig) {
-return createViewManager(context, metadataConfig, config, commonConfig, 
(SerializableSupplier) null);
+return createViewManager(context, metadataConfig, config, commonConfig, 
null);
   }
 
-  public static FileSystemViewManager createViewManager(final 
HoodieEngineContext context,
-final 
HoodieMetadataConfig metadataConfig,
-final 
FileSystemViewStorageConfig config,
-final 
HoodieCommonConfig commonConfig,
-final String basePath) 
{
+  public static FileSystemViewManager createViewManagerWithTableMetadata(final 
HoodieEngineContext context,
+ final 
HoodieMetadataConfig metadataConfig,
+ final 
FileSystemViewStorageConfig config,
+ final 
HoodieCommonConfig commonConfig) {
 return createViewManager(context, metadataConfig, config, commonConfig,

Review Comment:
   Not sure if we need to differentiate the `metadataConfig` across tables as 
well.  For now, it's fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9422:
URL: https://github.com/apache/hudi/pull/9422#issuecomment-1690550106

   
   ## CI report:
   
   * 42d026cd694d6368e45b058a4ff7a9bd36b0d3a2 UNKNOWN
   * baf417339e41d1eb4bad6b351162e0457a9965ad Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19425)
 
   * 976734054ed50853f16196cd68f8a3c2a5976728 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19430)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9422:
URL: https://github.com/apache/hudi/pull/9422#issuecomment-1690540693

   
   ## CI report:
   
   * 42d026cd694d6368e45b058a4ff7a9bd36b0d3a2 UNKNOWN
   * baf417339e41d1eb4bad6b351162e0457a9965ad Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19425)
 
   * 976734054ed50853f16196cd68f8a3c2a5976728 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9482: [HUDI-6728] Update BigQuery manifest sync to support schema evolution

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9482:
URL: https://github.com/apache/hudi/pull/9482#issuecomment-1690448816

   
   ## CI report:
   
   * 39166302aadd51524e017f92a883e960e07a37a4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19426)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9468:
URL: https://github.com/apache/hudi/pull/9468#issuecomment-1690448684

   
   ## CI report:
   
   * 6451d5f7db3326eb691185ec1b4351dc6cbbdd1b UNKNOWN
   * 9b2af18e26cea1d9ff7239d749a89ed35e9b4de8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19427)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9422:
URL: https://github.com/apache/hudi/pull/9422#issuecomment-1690448536

   
   ## CI report:
   
   * 42d026cd694d6368e45b058a4ff7a9bd36b0d3a2 UNKNOWN
   * baf417339e41d1eb4bad6b351162e0457a9965ad Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19425)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] CTTY commented on pull request #9444: [HUDI-6692] Don't default to bulk insert on nonpkless table if recordkey is omitted

2023-08-23 Thread via GitHub



CTTY commented on PR #9444:
URL: https://github.com/apache/hudi/pull/9444#issuecomment-1690447794

   Hi @nsivabalan, I checked the code and it looks like an expected behavior. 
For non-java17 test it also fails and repeats for 204 times. If we want it to 
repeat less then maybe we can look into reducing the `retryLoop` number at this 
[line](https://github.com/apache/hudi/blob/87a187a91d472f16d569b6388098d27a3aede760/hudi-common/src/test/java/org/apache/hudi/common/fs/TestFSUtilsWithRetryWrapperEnable.java#L74)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] twlo-sandeep commented on pull request #9475: [MINOR] fixing mysql debezium data loss

2023-08-23 Thread via GitHub



twlo-sandeep commented on PR #9475:
URL: https://github.com/apache/hudi/pull/9475#issuecomment-1690398458

   @rmahindra123 - can you please take a look at this PR? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9511:
URL: https://github.com/apache/hudi/pull/9511#issuecomment-1690374678

   
   ## CI report:
   
   * 350a0a363d2ebaa995f99edf2bebd88fe5056a27 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19429)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9509: [MINOR] Add operationType in alter table commit metadata

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9509:
URL: https://github.com/apache/hudi/pull/9509#issuecomment-1690374607

   
   ## CI report:
   
   * 96700cfb604e637219f231d28a84dfcf033112b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19424)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9511:
URL: https://github.com/apache/hudi/pull/9511#issuecomment-1690312328

   
   ## CI report:
   
   * 350a0a363d2ebaa995f99edf2bebd88fe5056a27 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6741) Timeline server cannot handle multiple base paths when metadata table is enabled

2023-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6741:
-
Labels: pull-request-available  (was: )

> Timeline server cannot handle multiple base paths when metadata table is 
> enabled
> 
>
> Key: HUDI-6741
> URL: https://issues.apache.org/jira/browse/HUDI-6741
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
>
> The Timeline Server will take in a view manager to gather the information 
> about the tables. When the metadata table is enabled, there is a supplier 
> that will be called to get the 
> HoodieTableMetadata. That supplier is configured for a single base path but 
> the timeline server can be used for multiple tables. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] the-other-tim-brown opened a new pull request, #9511: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled

2023-08-23 Thread via GitHub



the-other-tim-brown opened a new pull request, #9511:
URL: https://github.com/apache/hudi/pull/9511

   ### Change Logs
   
   Updates the logic for creating the `HoodieTableMetadata` in the 
`FileSystemViewManager` to be a function of the `HoodieTableMetaClient` instead 
of simply a supplier. This allows us to use the current basePath instead of 
using the base path that is available when creating the `FileSystemViewManager` 
which enables a single view manager to work properly with multiple tables when 
metadata table is enabled for those tables.
   
   ### Impact
   
   Fixes bug impacting use of a single timeline server for multiple tables with 
metadata table enabled for those tables.
   
   ### Risk level (write none, low medium or high below)
   
   low, small risk of changing the signature of a public method for those that 
may be calling it outside of the Hudi project. Otherwise, all changes are 
hidden from the user.
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (HUDI-6741) Timeline server cannot handle multiple base paths when metadata table is enabled

2023-08-23 Thread Timothy Brown (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Brown reassigned HUDI-6741:
---

Assignee: Timothy Brown

> Timeline server cannot handle multiple base paths when metadata table is 
> enabled
> 
>
> Key: HUDI-6741
> URL: https://issues.apache.org/jira/browse/HUDI-6741
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>
> The Timeline Server will take in a view manager to gather the information 
> about the tables. When the metadata table is enabled, there is a supplier 
> that will be called to get the 
> HoodieTableMetadata. That supplier is configured for a single base path but 
> the timeline server can be used for multiple tables. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-6741) Timeline server cannot handle multiple base paths when metadata table is enabled

2023-08-23 Thread Timothy Brown (Jira)

Timothy Brown created HUDI-6741:
---

 Summary: Timeline server cannot handle multiple base paths when 
metadata table is enabled
 Key: HUDI-6741
 URL: https://issues.apache.org/jira/browse/HUDI-6741
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Timothy Brown


The Timeline Server will take in a view manager to gather the information about 
the tables. When the metadata table is enabled, there is a supplier that will 
be called to get the 
HoodieTableMetadata. That supplier is configured for a single base path but the 
timeline server can be used for multiple tables. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9468:
URL: https://github.com/apache/hudi/pull/9468#issuecomment-1690218261

   
   ## CI report:
   
   * 599baa2fe937dd35d24fcb2315e99d084cf31c10 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19360)
 
   * 6451d5f7db3326eb691185ec1b4351dc6cbbdd1b UNKNOWN
   * 9b2af18e26cea1d9ff7239d749a89ed35e9b4de8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19427)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9510: [HUDI-6739] Skip checking Active Timeline to improve performance for multiple streaming writers

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9510:
URL: https://github.com/apache/hudi/pull/9510#issuecomment-1690206192

   
   ## CI report:
   
   * caf735fd3d8cba7d7e05ee7c8a08fcc6d810fb3f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19428)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9422:
URL: https://github.com/apache/hudi/pull/9422#issuecomment-1690205649

   
   ## CI report:
   
   * 42d026cd694d6368e45b058a4ff7a9bd36b0d3a2 UNKNOWN
   * 37e7683cbab44901aaeab3c08df772b4eda74590 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19413)
 
   * baf417339e41d1eb4bad6b351162e0457a9965ad Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19425)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9468:
URL: https://github.com/apache/hudi/pull/9468#issuecomment-1690205920

   
   ## CI report:
   
   * 599baa2fe937dd35d24fcb2315e99d084cf31c10 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19360)
 
   * 6451d5f7db3326eb691185ec1b4351dc6cbbdd1b UNKNOWN
   * 9b2af18e26cea1d9ff7239d749a89ed35e9b4de8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9482: [HUDI-6728] Update BigQuery manifest sync to support schema evolution

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9482:
URL: https://github.com/apache/hudi/pull/9482#issuecomment-1690189441

   
   ## CI report:
   
   * 6eca81567a20e35f4f89ac2267fb7d538e09feec Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19381)
 
   * 39166302aadd51524e017f92a883e960e07a37a4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19426)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9468:
URL: https://github.com/apache/hudi/pull/9468#issuecomment-1690189258

   
   ## CI report:
   
   * 599baa2fe937dd35d24fcb2315e99d084cf31c10 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19360)
 
   * 6451d5f7db3326eb691185ec1b4351dc6cbbdd1b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ad1happy2go commented on issue #9397: [SUPPORT] column_stats index filtering returns empty results

2023-08-23 Thread via GitHub



ad1happy2go commented on issue #9397:
URL: https://github.com/apache/hudi/issues/9397#issuecomment-169010

   @nandubatchu I was not able to reproduce this issue with sample dataset. Can 
you share your table configs. Below the the code I was trying to use. I 
confirmed its using col stats for one of the column.
   Please provide some more insights about your data to help us to triage it 
better.
   
   ```
   create table issue_col_stats_issue_9397_1 (
   id string,
   name string,
   col1 int,
   col2 int,
   col3 int,
   updated_at string
   ) using hudi
   tblproperties (
   type = 'cow',
   primaryKey = 'id',
   preCombineField = 'updated_at',
   hoodie.metadata.index.column.stats.enable = 'true',
   hoodie.metadata.index.column.stats.column.list = 'col1'
   ) location 'file:///tmp/issue_col_stats_issue_9397_1';
   
   INSERT INTO issue_col_stats_issue_9397_1 VALUES
   ('1', 'c',1,1,1,'2023-01-01 00:00:00'),
   ('2', 'c',2,2,2,'2023-01-01 00:00:00'),
   ('3', 'c',3,3,3,'2023-01-01 00:00:00');
   
   
   -- Verify if col stats present at 
/tmp/issue_col_stats_issue_9397_1/.hoodie/metadata/
   select * from issue_col_stats_issue_9397_1 where col1 = 1 and col2  =  1 and 
col3  =  1;
   -- Correct Results
   select * from issue_col_stats_issue_9397_1 where col1 > 1 and col2 > 1 and 
col3 > 1;
   -- Correct Results
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9510: [HUDI-6739] Skip checking Active Timeline to improve performance for multiple streaming writers

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9510:
URL: https://github.com/apache/hudi/pull/9510#issuecomment-1690189633

   
   ## CI report:
   
   * caf735fd3d8cba7d7e05ee7c8a08fcc6d810fb3f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9422:
URL: https://github.com/apache/hudi/pull/9422#issuecomment-1690189040

   
   ## CI report:
   
   * 42d026cd694d6368e45b058a4ff7a9bd36b0d3a2 UNKNOWN
   * 37e7683cbab44901aaeab3c08df772b4eda74590 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19413)
 
   * baf417339e41d1eb4bad6b351162e0457a9965ad UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario

2023-08-23 Thread via GitHub



nsivabalan commented on code in PR #9468:
URL: https://github.com/apache/hudi/pull/9468#discussion_r1303181318


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java:
##
@@ -278,4 +280,14 @@ public HoodieCleanMetadata execute() {
 // This requires the CleanActionExecutor to be refactored as 
BaseCommitActionExecutor
 return cleanMetadataList.size() > 0 ? 
cleanMetadataList.get(cleanMetadataList.size() - 1) : null;
   }
+
+  private void checkIfOtherWriterCommitted(HoodieInstant hoodieInstant, 
HoodieIOException e) {
+table.getMetaClient().reloadActiveTimeline();
+if 
(table.getCleanTimeline().filterCompletedInstants().containsInstant(hoodieInstant.getTimestamp()))
 {
+  LOG.warn("Clean operation was completed by another writer for instant: " 
+ hoodieInstant);
+} else {
+  LOG.warn("Failed to perform previous clean operation, instant: " + 
hoodieInstant, e);

Review Comment:
   LOG.error()



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9509: [MINOR] Add operationType in alter table commit metadata

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9509:
URL: https://github.com/apache/hudi/pull/9509#issuecomment-1690114750

   
   ## CI report:
   
   * 96700cfb604e637219f231d28a84dfcf033112b9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19424)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9482: [HUDI-6728] Update BigQuery manifest sync to support schema evolution

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9482:
URL: https://github.com/apache/hudi/pull/9482#issuecomment-1690114615

   
   ## CI report:
   
   * 6eca81567a20e35f4f89ac2267fb7d538e09feec Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19381)
 
   * 39166302aadd51524e017f92a883e960e07a37a4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6739) Avoid checking timeline for successful commits for spark structured streaming when offset is 0

2023-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6739:
-
Labels: pull-request-available  (was: )

> Avoid checking timeline for successful commits for spark structured streaming 
> when offset is 0
> --
>
> Key: HUDI-6739
> URL: https://issues.apache.org/jira/browse/HUDI-6739
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Aditya Goenka
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> For Spark Structured Streaming writer, for new writer we are checking the 
> active timeline to identify the last successful commit and fix idempotency. 
> No need to check the checkpoint when the offset is 0 as that will mean it 
> will be the new writer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] ad1happy2go opened a new pull request, #9510: [HUDI-6739] Skip checking Active Timeline to improve performance

2023-08-23 Thread via GitHub



ad1happy2go opened a new pull request, #9510:
URL: https://github.com/apache/hudi/pull/9510

   ### Change Logs
   
   Improved the logic for Fixing checkpoint management for multiple streaming 
writers, so that it dont check the active timeline if offset is 0 as that will 
mean it is a new writer.
   
   ### Impact
   none
   
   ### Risk level (write none, low medium or high below)
   low
   
   ### Documentation Update
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9509: [MINOR] Add operationType in alter table commit metadata

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9509:
URL: https://github.com/apache/hudi/pull/9509#issuecomment-1690099343

   
   ## CI report:
   
   * 96700cfb604e637219f231d28a84dfcf033112b9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9485: [HUDI-6730] Enable hoodie configuration using the --conf option with the "spark." prefix

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9485:
URL: https://github.com/apache/hudi/pull/9485#issuecomment-1690084274

   
   ## CI report:
   
   * 48b4ac4196cd2546bca4a5c6f2c81a353c80cd8d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19422)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Zouxxyy opened a new pull request, #9509: [MINOR] Add operationType in alter table commit metadata

2023-08-23 Thread via GitHub



Zouxxyy opened a new pull request, #9509:
URL: https://github.com/apache/hudi/pull/9509

   ### Change Logs
   
   Add `operationType` in alter table commit metadata
   
   before,  xxx.commit
   ```json
   {
 "partitionToWriteStats" : { },
 "compacted" : false,
 "extraMetadata" : {
   "schema" : "xxx",
   "latest_schema" : "xxx"
 },
 "operationType" : null
   }
   ```
   
   after, xxx.commit
   
   ```json
   {
 "partitionToWriteStats" : { },
 "compacted" : false,
 "extraMetadata" : {
   "schema" : "xxx",
   "latest_schema" : "xxx"
 },
 "operationType" : "ALTER_SCHEMA"
   }
   ```
   
   ### Impact
   
   Above
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9501: [HUDI-6735] Adding support for snapshotLoadQuerySplitter for incremental sources.

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9501:
URL: https://github.com/apache/hudi/pull/9501#issuecomment-1689979230

   
   ## CI report:
   
   * 2db1e3e6179a5b45172ea9abae0c869375fffc1b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19421)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [HUDI-6621] Fix downgrade handler for 0.14.0 (#9467)

2023-08-23 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 87a187a91d4 [HUDI-6621] Fix downgrade handler for 0.14.0 (#9467)
87a187a91d4 is described below

commit 87a187a91d472f16d569b6388098d27a3aede760
Author: Lokesh Jain 
AuthorDate: Wed Aug 23 18:39:08 2023 +0530

[HUDI-6621] Fix downgrade handler for 0.14.0 (#9467)

- Since the log block version (due to delete block change) has been 
upgraded in 0.14.0, the delete blocks can not be read in 0.13.0 or earlier.
- Similarly the addition of record level index field in metadata table 
leads to column drop error on downgrade. The Jira aims to fix the downgrade 
handler to trigger compaction and delete metadata table if user wishes to 
downgrade from version six (0.14.0) to version 5 (0.13.0).
---
 .../table/upgrade/SixToFiveDowngradeHandler.java   |  53 ++--
 .../table/upgrade/SupportsUpgradeDowngrade.java|   3 +
 .../table/upgrade/FlinkUpgradeDowngradeHelper.java |   7 +
 .../table/upgrade/JavaUpgradeDowngradeHelper.java  |   7 +
 .../table/upgrade/SparkUpgradeDowngradeHelper.java |   7 +
 .../hudi/table/upgrade/TestUpgradeDowngrade.java   |  10 +-
 .../functional/TestSixToFiveDowngradeHandler.scala | 142 +
 7 files changed, 211 insertions(+), 18 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SixToFiveDowngradeHandler.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SixToFiveDowngradeHandler.java
index 228c0f710a8..4793f368f81 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SixToFiveDowngradeHandler.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SixToFiveDowngradeHandler.java
@@ -18,19 +18,26 @@
 
 package org.apache.hudi.table.upgrade;
 
+import org.apache.hudi.client.BaseHoodieWriteClient;
 import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
 import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.model.HoodieTableType;
 import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.common.table.HoodieTableMetaClient;
-import org.apache.hudi.common.table.HoodieTableVersion;
 import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
 import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator;
 import org.apache.hudi.common.table.timeline.HoodieTimeline;
 import org.apache.hudi.common.util.FileIOUtils;
 import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieCompactionConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.metadata.MetadataPartitionType;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.metadata.HoodieTableMetadataUtil;
 import org.apache.hudi.table.HoodieTable;
+import org.apache.hudi.table.action.compact.CompactionTriggerStrategy;
+import 
org.apache.hudi.table.action.compact.strategy.UnBoundedCompactionStrategy;
 
 import org.apache.hadoop.fs.Path;
 
@@ -39,12 +46,15 @@ import java.util.Map;
 
 import static 
org.apache.hudi.common.table.HoodieTableConfig.TABLE_METADATA_PARTITIONS;
 import static 
org.apache.hudi.common.table.HoodieTableConfig.TABLE_METADATA_PARTITIONS_INFLIGHT;
-import static 
org.apache.hudi.metadata.HoodieTableMetadataUtil.deleteMetadataTablePartition;
 
 /**
  * Downgrade handle to assist in downgrading hoodie table from version 6 to 5.
  * To ensure compatibility, we need recreate the compaction requested file to
  * .aux folder.
+ * Since version 6 includes a new schema field for metadata table(MDT),
+ * the MDT needs to be deleted during downgrade to avoid column drop error.
+ * Also log block version was upgraded in version 6, therefore full compaction 
needs
+ * to be completed during downgrade to avoid both read and future compaction 
failures.
  */
 public class SixToFiveDowngradeHandler implements DowngradeHandler {
 
@@ -52,11 +62,16 @@ public class SixToFiveDowngradeHandler implements 
DowngradeHandler {
   public Map downgrade(HoodieWriteConfig config, 
HoodieEngineContext context, String instantTime, SupportsUpgradeDowngrade 
upgradeDowngradeHelper) {
 final HoodieTable table = upgradeDowngradeHelper.getTable(config, context);
 
-removeRecordIndexIfNeeded(table, context);
+// Since version 6 includes a new schema field for metadata table(MDT), 
the MDT needs to be deleted during downgrade to avoid column drop error.
+HoodieTableMetadataUtil.deleteMetadataTable(config.getBasePath(), context);
+// The log block version has been upgraded in version six so compaction is 
required for downgrade.
+runCompaction(table,

[GitHub] [hudi] nsivabalan merged pull request #9467: [HUDI-6621] Fix downgrade handler for 0.14.0

2023-08-23 Thread via GitHub



nsivabalan merged PR #9467:
URL: https://github.com/apache/hudi/pull/9467


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Hans-Raintree opened a new issue, #9508: [SUPPORT] Reading CDC logs can't extract data type

2023-08-23 Thread via GitHub



Hans-Raintree opened a new issue, #9508:
URL: https://github.com/apache/hudi/issues/9508

   Describe the problem you faced
   
   When reading cdc logs in PySpark the before/after columns are returned in 
json format. It's difficult to convert them back to the correct data types, ie. 
a timestamp is returned as a number and if you want to convert it back to a 
timestamp you have to do something like this:
   
   df = df.withColumn(col_name, 
to_timestamp(from_unixtime(df[col_name]/100)))
   
   With decimal datatype it's stored as something like this [0, 0, 0, 0, 0, 0, 
26, -27, -78], which is even more difficult to parse back to a number.
   
   I see that the data types are stored in the .cdc file, but I don't see how I 
can access them.
   
   To Reproduce
   
   Steps to reproduce the behavior:
   
   Write a table with
   'hoodie.table.cdc.enabled': 'true',
   'hoodie.table.cdc.supplemental.logging.mode': 'data_before_after'
   Read the cdc logs with:
   'hoodie.datasource.query.type': 'incremental',
   'hoodie.datasource.read.begin.instanttime': begin_time,
   'hoodie.datasource.query.incremental.format': 'cdc'
   
   Expected behavior
   
   A way to extract columns from the before/after columns with the datatype.
   
   Environment Description
   
   Hudi version : 0.13.1
   
   Spark version : 3.3.2 / 3.4.0
   
   Hive version : 3.1.3
   
   Hadoop version : 2.7 / 3.3.3
   
   Storage (HDFS/S3/GCS..) : S3 / Local
   
   Running on Docker? (yes/no) : no
   
   Additional context
   
   Happens both in AWS EMR 6.12.0 and when running locally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6740) Add 0.13.x to Spark 3 support matrix doc

2023-08-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6740:
-
Labels: pull-request-available  (was: )

> Add 0.13.x to Spark 3 support matrix doc
> 
>
> Key: HUDI-6740
> URL: https://issues.apache.org/jira/browse/HUDI-6740
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Akira Ajisaka
>Priority: Minor
>  Labels: pull-request-available
>
> Hudi 0.13.x is missing in the Spark 3 support matrix doc 
> [https://hudi.apache.org/docs/quick-start-guide/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] aajisaka opened a new pull request, #9507: [HUDI-6740] Add 0.13.x to Spark 3 support matrix doc

2023-08-23 Thread via GitHub



aajisaka opened a new pull request, #9507:
URL: https://github.com/apache/hudi/pull/9507

   ### Change Logs
   
   Add Hudi 0.13.x to the Spark 3 support matrix doc 
https://hudi.apache.org/docs/quick-start-guide/
   
   ### Impact
   
   Doc only change
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   This is documentation only update
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6740) Add 0.13.x to Spark 3 support matrix doc

2023-08-23 Thread Akira Ajisaka (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated HUDI-6740:

Description: Hudi 0.13.x is missing in the Spark 3 support matrix doc 
[https://hudi.apache.org/docs/quick-start-guide/]  (was: Hudi 0.13.x is missing 
in the Spark 3 matrix support doc 
[https://hudi.apache.org/docs/quick-start-guide/])
Summary: Add 0.13.x to Spark 3 support matrix doc  (was: Add 0.13.x to 
Spark 3 matrix support doc)

> Add 0.13.x to Spark 3 support matrix doc
> 
>
> Key: HUDI-6740
> URL: https://issues.apache.org/jira/browse/HUDI-6740
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Akira Ajisaka
>Priority: Minor
>
> Hudi 0.13.x is missing in the Spark 3 support matrix doc 
> [https://hudi.apache.org/docs/quick-start-guide/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-6740) Add 0.13.x to Spark 3 matrix support doc

2023-08-23 Thread Akira Ajisaka (Jira)

Akira Ajisaka created HUDI-6740:
---

 Summary: Add 0.13.x to Spark 3 matrix support doc
 Key: HUDI-6740
 URL: https://issues.apache.org/jira/browse/HUDI-6740
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Akira Ajisaka


 

Hudi 0.13.x is missing in the Spark 3 matrix support doc 
[https://hudi.apache.org/docs/quick-start-guide/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6740) Add 0.13.x to Spark 3 matrix support doc

2023-08-23 Thread Akira Ajisaka (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated HUDI-6740:

Description: Hudi 0.13.x is missing in the Spark 3 matrix support doc 
[https://hudi.apache.org/docs/quick-start-guide/]  (was:  

Hudi 0.13.x is missing in the Spark 3 matrix support doc 
[https://hudi.apache.org/docs/quick-start-guide/])

> Add 0.13.x to Spark 3 matrix support doc
> 
>
> Key: HUDI-6740
> URL: https://issues.apache.org/jira/browse/HUDI-6740
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Akira Ajisaka
>Priority: Minor
>
> Hudi 0.13.x is missing in the Spark 3 matrix support doc 
> [https://hudi.apache.org/docs/quick-start-guide/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #9485: [HUDI-6730] Enable hoodie configuration using the --conf option with the "spark." prefix

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9485:
URL: https://github.com/apache/hudi/pull/9485#issuecomment-1689818375

   
   ## CI report:
   
   * bcbf6a91785f83ca29e61a923b51d6b429f22f01 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19406)
 
   * 48b4ac4196cd2546bca4a5c6f2c81a353c80cd8d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19422)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9485: [HUDI-6730] Enable hoodie configuration using the --conf option with the "spark." prefix

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9485:
URL: https://github.com/apache/hudi/pull/9485#issuecomment-1689807493

   
   ## CI report:
   
   * bcbf6a91785f83ca29e61a923b51d6b429f22f01 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19406)
 
   * 48b4ac4196cd2546bca4a5c6f2c81a353c80cd8d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-6739) Avoid checking timeline for successful commits for spark structured streaming when offset is 0

2023-08-23 Thread Aditya Goenka (Jira)

Aditya Goenka created HUDI-6739:
---

 Summary: Avoid checking timeline for successful commits for spark 
structured streaming when offset is 0
 Key: HUDI-6739
 URL: https://issues.apache.org/jira/browse/HUDI-6739
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark
Reporter: Aditya Goenka
 Fix For: 1.0.0


For Spark Structured Streaming writer, for new writer we are checking the 
active timeline to identify the last successful commit and fix idempotency. 

No need to check the checkpoint when the offset is 0 as that will mean it will 
be the new writer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] JoshuaZhuCN commented on issue #9418: [SUPPORT] Hudi table does not support Spark SQL's cache table syntax

2023-08-23 Thread via GitHub



JoshuaZhuCN commented on issue #9418:
URL: https://github.com/apache/hudi/issues/9418#issuecomment-1689701271

   @ad1happy2go  It's really strange. I tried it in my testing environment and 
there was no problem. The same code has problems in the production environment, 
so I'll investigate the differences in my environment first


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] lokeshj1703 commented on pull request #9467: [HUDI-6621] Fix downgrade handler for 0.14.0

2023-08-23 Thread via GitHub



lokeshj1703 commented on PR #9467:
URL: https://github.com/apache/hudi/pull/9467#issuecomment-1689684602

   https://github.com/apache/hudi/assets/9255455/56fff9c4-70d3-4ce1-941d-e25a1e7539c2;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] lokeshj1703 commented on pull request #9467: [HUDI-6621] Fix downgrade handler for 0.14.0

2023-08-23 Thread via GitHub



lokeshj1703 commented on PR #9467:
URL: https://github.com/apache/hudi/pull/9467#issuecomment-1689683472

   ![Uploading Screenshot 2023-08-23 at 3.40.17 PM.png…]()
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9501: [HUDI-6735] Adding support for snapshotLoadQuerySplitter for incremental sources.

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9501:
URL: https://github.com/apache/hudi/pull/9501#issuecomment-1689654573

   
   ## CI report:
   
   * 37f1a8c2a46dcd7f7f549e2deffaaa85b65f58e1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19407)
 
   * 2db1e3e6179a5b45172ea9abae0c869375fffc1b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19421)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9501: [HUDI-6735] Adding support for snapshotLoadQuerySplitter for incremental sources.

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9501:
URL: https://github.com/apache/hudi/pull/9501#issuecomment-1689641469

   
   ## CI report:
   
   * 37f1a8c2a46dcd7f7f549e2deffaaa85b65f58e1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19407)
 
   * 2db1e3e6179a5b45172ea9abae0c869375fffc1b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9472:
URL: https://github.com/apache/hudi/pull/9472#issuecomment-1689626305

   
   ## CI report:
   
   * d403d1696e76ee6daf6ad6ddc62e9a218a5accab Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19417)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] harsh1231 commented on a diff in pull request #9501: [HUDI-6735] Adding support for snapshotLoadQuerySplitter for incremental sources.

2023-08-23 Thread via GitHub



harsh1231 commented on code in PR #9501:
URL: https://github.com/apache/hudi/pull/9501#discussion_r1302716600


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/SnapshotLoadQuerySplitter.java:
##
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+
+/**
+ * Abstract splitter responsible for managing the snapshot load query 
operations.
+ */
+public abstract class SnapshotLoadQuerySplitter {

Review Comment:
   Moved to sources package . 
   it's only for snapshot load query of incremental sources , so name kept as 
it is . 
   Added unit test for with test implementation , concrete implementation at 
this point . 
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] harsh1231 commented on a diff in pull request #9501: [HUDI-6735] Adding support for snapshotLoadQuerySplitter for incremental sources.

2023-08-23 Thread via GitHub



harsh1231 commented on code in PR #9501:
URL: https://github.com/apache/hudi/pull/9501#discussion_r1302716600


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/SnapshotLoadQuerySplitter.java:
##
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+
+/**
+ * Abstract splitter responsible for managing the snapshot load query 
operations.
+ */
+public abstract class SnapshotLoadQuerySplitter {

Review Comment:
   Moved to sources package . 
   Name should indicate it's only for snapshot load query of incremental 
sources , so kept as it is . 
   Added unit test for with test implementation , concrete implementation at 
this point . 
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-4756) Clean up usages of "assume.date.partition" config within hudi

2023-08-23 Thread Lin Liu (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757870#comment-17757870
 ] 

Lin Liu commented on HUDI-4756:
---

Realized that there are too many unit tests that need debugging, and the 
efforts for this issue is uncertain. Therefore, will work on this issue in the 
background, and prioritize other tasks for now.

> Clean up usages of "assume.date.partition" config within hudi
> -
>
> Key: HUDI-4756
> URL: https://issues.apache.org/jira/browse/HUDI-4756
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: configs
>Reporter: sivabalan narayanan
>Assignee: Lin Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> looks like "assume.date.partition" is not used anywhere within hudi. lets 
> clean up the usages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-6701) Explore use of UUID-6/7 as a replacement for current auto generated keys

2023-08-23 Thread Lin Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Liu closed HUDI-6701.
-
Resolution: Information Provided

> Explore use of UUID-6/7 as a replacement for current auto generated keys
> 
>
> Key: HUDI-6701
> URL: https://issues.apache.org/jira/browse/HUDI-6701
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> Today, we auto generate string keys of the form 
> (HoodieRecord#generateSequenceId), which is highly compressible, esp compared 
> to uuidv1, when we store as a string column inside a parquet file.
> {code:java}
>   public static String generateSequenceId(String instantTime, int 
> partitionId, long recordIndex) {
> return instantTime + "_" + partitionId + "_" + recordIndex;
>   }
> {code}
> As a part of this task, we'd love to understand if 
> - Can uuid6 or 7, provide similar compressed storage footprint when written 
> as a column in a parquet file. 
> - can the current format be represented as a 160-bit number i.e 2 longs, 1 
> int in storage? would that save us further in storage costs?  
> (Orthogonal consideration is the memory needed to hold the key string, which 
> can be higher than a 160bits. We can discuss this later, once we understand 
> storage footprint) 
>  
> Resources:
> * https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/09/ 
> * https://github.com/uuid6/uuid6-ietf-draft
> * https://github.com/uuid6/prototypes 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-6701) Explore use of UUID-6/7 as a replacement for current auto generated keys

2023-08-23 Thread Lin Liu (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-6701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757867#comment-17757867
 ] 

Lin Liu commented on HUDI-6701:
---

Based on our discussion and experiments, we conclude that we will keep the 
current row key format since it reaches the sweet spot between time and storage 
tradeoffs. This task is closed for now.

> Explore use of UUID-6/7 as a replacement for current auto generated keys
> 
>
> Key: HUDI-6701
> URL: https://issues.apache.org/jira/browse/HUDI-6701
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> Today, we auto generate string keys of the form 
> (HoodieRecord#generateSequenceId), which is highly compressible, esp compared 
> to uuidv1, when we store as a string column inside a parquet file.
> {code:java}
>   public static String generateSequenceId(String instantTime, int 
> partitionId, long recordIndex) {
> return instantTime + "_" + partitionId + "_" + recordIndex;
>   }
> {code}
> As a part of this task, we'd love to understand if 
> - Can uuid6 or 7, provide similar compressed storage footprint when written 
> as a column in a parquet file. 
> - can the current format be represented as a 160-bit number i.e 2 longs, 1 
> int in storage? would that save us further in storage costs?  
> (Orthogonal consideration is the memory needed to hold the key string, which 
> can be higher than a 160bits. We can discuss this later, once we understand 
> storage footprint) 
>  
> Resources:
> * https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/09/ 
> * https://github.com/uuid6/uuid6-ietf-draft
> * https://github.com/uuid6/prototypes 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated (c45c3782d33 -> 518d8728b1f)

2023-08-23 Thread codope

This is an automated email from the ASF dual-hosted git repository.

codope pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from c45c3782d33 [HUDI-6549] Add support for comma separated path format 
for spark.read.load (#9503)
 add 518d8728b1f [HUDI-4115] Adding support for schema while loading spark 
dataset in S3/GCS source (#9502)

No new revisions were added by this update.

Summary of changes:
 .../sources/GcsEventsHoodieIncrSource.java |  5 +++-
 .../sources/S3EventsHoodieIncrSource.java  |  5 +++-
 .../sources/helpers/CloudDataFetcher.java  |  6 ++--
 .../helpers/CloudObjectsSelectorCommon.java| 17 ++-
 .../sources/TestGcsEventsHoodieIncrSource.java | 34 +++---
 .../sources/TestS3EventsHoodieIncrSource.java  | 28 +-
 .../helpers/TestCloudObjectsSelectorCommon.java| 17 +++
 .../test/resources/schema/sample_data_schema.avsc  | 16 ++
 .../src/test/resources/schema/sample_gcs_data.avsc | 20 -
 9 files changed, 111 insertions(+), 37 deletions(-)
 copy 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BucketType.java
 => hudi-utilities/src/test/resources/schema/sample_data_schema.avsc (81%)
 copy hudi-common/src/test/resources/simple-test.avsc => 
hudi-utilities/src/test/resources/schema/sample_gcs_data.avsc (78%)

[GitHub] [hudi] codope merged pull request #9502: [HUDI-4115] Adding support for schema while loading spark dataset in s3/gcs source

2023-08-23 Thread via GitHub



codope merged PR #9502:
URL: https://github.com/apache/hudi/pull/9502


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9502: [HUDI-4115] Adding support for schema while loading spark dataset in s3/gcs source

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9502:
URL: https://github.com/apache/hudi/pull/9502#issuecomment-1689450574

   
   ## CI report:
   
   * 63342240f5d36496ac7ac12565878ae661600ec2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19408)
 
   * 0833c22802627298ba32024eed36d4114204a389 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19419)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on pull request #9502: [HUDI-4115] Adding support for schema while loading spark dataset in s3/gcs source

2023-08-23 Thread via GitHub



codope commented on PR #9502:
URL: https://github.com/apache/hudi/pull/9502#issuecomment-1689449336

   Azure CI passed 
   https://github.com/apache/hudi/assets/16440354/fbf410cd-f650-430f-94f6-da110d578ceb;>
   GH actions were already green last time and only license header was added in 
the subsequent commit. Landing the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9502: [HUDI-4115] Adding support for schema while loading spark dataset in s3/gcs source

2023-08-23 Thread via GitHub



hudi-bot commented on PR #9502:
URL: https://github.com/apache/hudi/pull/9502#issuecomment-1689439250

   
   ## CI report:
   
   * 63342240f5d36496ac7ac12565878ae661600ec2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19408)
 
   * 0833c22802627298ba32024eed36d4114204a389 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] harsh1231 commented on a diff in pull request #9501: [HUDI-6735] Adding support for snapshotLoadQuerySplitter for incremental sources.

2023-08-23 Thread via GitHub



harsh1231 commented on code in PR #9501:
URL: https://github.com/apache/hudi/pull/9501#discussion_r1302582210


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java:
##
@@ -128,6 +133,10 @@ public static class Config {
   public HoodieIncrSource(TypedProperties props, JavaSparkContext 
sparkContext, SparkSession sparkSession,
   SchemaProvider schemaProvider) {
 super(props, sparkContext, sparkSession, schemaProvider);
+
+this.snapshotLoadQuerySplitter = 
Option.ofNullable(props.getString(SNAPSHOT_LOAD_QUERY_SPLITTER_CLASS_NAME, 
null))
+.map(className -> (SnapshotLoadQuerySplitter) 
ReflectionUtils.loadClass(className,
+  new Class[]{TypedProperties.class}, props));

Review Comment:
   yeah this will fallback to regular pipeline for null , does not exist we 
should keep i think , ( could be jar issue ) -> fail pipeline if class not 
found . 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] harsh1231 commented on a diff in pull request #9501: [HUDI-6735] Adding support for snapshotLoadQuerySplitter for incremental sources.

2023-08-23 Thread via GitHub



harsh1231 commented on code in PR #9501:
URL: https://github.com/apache/hudi/pull/9501#discussion_r1302582210


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java:
##
@@ -128,6 +133,10 @@ public static class Config {
   public HoodieIncrSource(TypedProperties props, JavaSparkContext 
sparkContext, SparkSession sparkSession,
   SchemaProvider schemaProvider) {
 super(props, sparkContext, sparkSession, schemaProvider);
+
+this.snapshotLoadQuerySplitter = 
Option.ofNullable(props.getString(SNAPSHOT_LOAD_QUERY_SPLITTER_CLASS_NAME, 
null))
+.map(className -> (SnapshotLoadQuerySplitter) 
ReflectionUtils.loadClass(className,
+  new Class[]{TypedProperties.class}, props));

Review Comment:
   yeah this will fallback to regular pipeline for null , does not exist we 
should keep i think , ( could be jar issue ) . 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on a diff in pull request #9501: [HUDI-6735] Adding support for snapshotLoadQuerySplitter for incremental sources.

2023-08-23 Thread via GitHub



codope commented on code in PR #9501:
URL: https://github.com/apache/hudi/pull/9501#discussion_r1302576522


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/SnapshotLoadQuerySplitter.java:
##
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+
+/**
+ * Abstract splitter responsible for managing the snapshot load query 
operations.
+ */
+public abstract class SnapshotLoadQuerySplitter {

Review Comment:
   1. This is not a helper. Move the abstraction out of `helpers` package.
   2. How about keeping the base class name simply `IncrementalLoadSplitter` as 
it is used only for incremental source?
   3. Why are we not providing any concrete out-of-box implementation?
   4. Can we add unit test for the splitter `getNextCheckpoint`?



##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java:
##
@@ -128,6 +133,10 @@ public static class Config {
   public HoodieIncrSource(TypedProperties props, JavaSparkContext 
sparkContext, SparkSession sparkSession,
   SchemaProvider schemaProvider) {
 super(props, sparkContext, sparkSession, schemaProvider);
+
+this.snapshotLoadQuerySplitter = 
Option.ofNullable(props.getString(SNAPSHOT_LOAD_QUERY_SPLITTER_CLASS_NAME, 
null))
+.map(className -> (SnapshotLoadQuerySplitter) 
ReflectionUtils.loadClass(className,
+  new Class[]{TypedProperties.class}, props));

Review Comment:
   If it's null or does not exist, should we fallback to regular path instead 
of breaking the pipeline?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-6549) Add support for comma separated read path format in CloudObjectsSelectorCommon

2023-08-23 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-6549.
-
Fix Version/s: 0.14.0
   Resolution: Fixed

> Add support for comma separated read path format in  
> CloudObjectsSelectorCommon
> ---
>
> Key: HUDI-6549
> URL: https://issues.apache.org/jira/browse/HUDI-6549
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Harshal Patil
>Assignee: Harshal Patil
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> For file formats like xml , comma separated path formats are supported for 
> loading data from more than 1 file . 
> It does not support array of strings in spark.read.load() . 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] codope merged pull request #9503: [HUDI-6549] Adding support for comma separated path format for spark.read.load

2023-08-23 Thread via GitHub



codope merged PR #9503:
URL: https://github.com/apache/hudi/pull/9503


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated (cbd6f91991f -> c45c3782d33)

2023-08-23 Thread codope

This is an automated email from the ASF dual-hosted git repository.

codope pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from cbd6f91991f [HUDI-6692] Don't default to bulk insert on nonpkless 
table if recordkey is omitted (#9444)
 add c45c3782d33 [HUDI-6549] Add support for comma separated path format 
for spark.read.load (#9503)

No new revisions were added by this update.

Summary of changes:
 .../sources/helpers/CloudObjectsSelectorCommon.java  | 11 ++-
 .../utilities/sources/helpers/CloudStoreIngestionConfig.java | 12 
 .../sources/helpers/TestCloudObjectsSelectorCommon.java  |  1 +
 3 files changed, 23 insertions(+), 1 deletion(-)

1 2 >

1 - 100 of 102 matches

Mail list logo