[GitHub] [hudi] majian1998 commented on a diff in pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.

2023-08-21 Thread via GitHub


majian1998 commented on code in PR #9472:
URL: https://github.com/apache/hudi/pull/9472#discussion_r1300994921


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ReplaceCommitValidateUtil.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.cluster;
+
+import org.apache.hudi.common.model.HoodieReplaceCommitMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.JsonUtils;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.util.FileIOUtils.LOG;
+
+public class ReplaceCommitValidateUtil {
+  public static final String REPLACE_COMMIT_FILE_IDS = "replaceCommitFileIds";

Review Comment:
   This is because it appears that there is no relevant file ID information in 
the inflight status. However, if there are two plans being executed 
simultaneously and not yet submitted, can we prevent this situation from 
occurring by performing additional validation on the inflight status?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] majian1998 commented on a diff in pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.

2023-08-21 Thread via GitHub


majian1998 commented on code in PR #9472:
URL: https://github.com/apache/hudi/pull/9472#discussion_r1300983456


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ReplaceCommitValidateUtil.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.cluster;
+
+import org.apache.hudi.common.model.HoodieReplaceCommitMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.JsonUtils;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.util.FileIOUtils.LOG;
+
+public class ReplaceCommitValidateUtil {
+  public static final String REPLACE_COMMIT_FILE_IDS = "replaceCommitFileIds";
+  public static void validateReplaceCommit(HoodieTableMetaClient metaClient) {
+metaClient.reloadActiveTimeline();
+Set replaceFileids = new HashSet<>();
+
+// Verify pending and completed replace commit
+
Stream.concat(metaClient.getActiveTimeline().getCompletedReplaceTimeline().getInstants().stream(),
+
metaClient.getActiveTimeline().filterInflights().filterPendingReplaceTimeline().getInstants().stream()).map(instant
 -> {
+  try {
+HoodieReplaceCommitMetadata replaceCommitMetadata =
+
HoodieReplaceCommitMetadata.fromBytes(metaClient.getActiveTimeline().getInstantDetails(instant).get(),
+HoodieReplaceCommitMetadata.class);
+if (!instant.isCompleted()) {
+  return 
JsonUtils.getObjectMapper().readValue(replaceCommitMetadata.getExtraMetadata().getOrDefault(REPLACE_COMMIT_FILE_IDS,
 "non-existent key"), String[].class);
+} else {
+  return 
replaceCommitMetadata.getPartitionToReplaceFileIds().values().stream()
+  .flatMap(List::stream)
+  .toArray(String[]::new);
+}
+  } catch (IOException e) {
+// If the key does not exist or there is a JSON parsing error, LOG 
reports an error and ignores it.
+LOG.error("Error when reading replace commit meta", e);
+return null;
+  }
+}).filter(Objects::nonNull)
+.forEach(fileIdArray -> {
+  Arrays.stream(fileIdArray)
+  .filter(fileId -> !replaceFileids.add(fileId))
+  .findFirst()
+  .ifPresent(s -> {
+throw new HoodieException("Replace commit involves duplicate 
file id!");
+  });
+});
+  }
+}

Review Comment:
   



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ReplaceCommitValidateUtil.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.cluster;
+
+import org.apache.hudi.common.model.HoodieReplaceCommitMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.JsonUtils;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Stream;
+
+import 

[GitHub] [hudi] majian1998 commented on a diff in pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.

2023-08-21 Thread via GitHub


majian1998 commented on code in PR #9472:
URL: https://github.com/apache/hudi/pull/9472#discussion_r1300983273


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ReplaceCommitValidateUtil.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.cluster;
+
+import org.apache.hudi.common.model.HoodieReplaceCommitMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.JsonUtils;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.util.FileIOUtils.LOG;
+
+public class ReplaceCommitValidateUtil {
+  public static final String REPLACE_COMMIT_FILE_IDS = "replaceCommitFileIds";
+  public static void validateReplaceCommit(HoodieTableMetaClient metaClient) {
+metaClient.reloadActiveTimeline();
+Set replaceFileids = new HashSet<>();
+
+// Verify pending and completed replace commit
+
Stream.concat(metaClient.getActiveTimeline().getCompletedReplaceTimeline().getInstants().stream(),
+
metaClient.getActiveTimeline().filterInflights().filterPendingReplaceTimeline().getInstants().stream()).map(instant
 -> {
+  try {
+HoodieReplaceCommitMetadata replaceCommitMetadata =
+
HoodieReplaceCommitMetadata.fromBytes(metaClient.getActiveTimeline().getInstantDetails(instant).get(),
+HoodieReplaceCommitMetadata.class);
+if (!instant.isCompleted()) {
+  return 
JsonUtils.getObjectMapper().readValue(replaceCommitMetadata.getExtraMetadata().getOrDefault(REPLACE_COMMIT_FILE_IDS,
 "non-existent key"), String[].class);
+} else {
+  return 
replaceCommitMetadata.getPartitionToReplaceFileIds().values().stream()
+  .flatMap(List::stream)
+  .toArray(String[]::new);
+}
+  } catch (IOException e) {
+// If the key does not exist or there is a JSON parsing error, LOG 
reports an error and ignores it.
+LOG.error("Error when reading replace commit meta", e);
+return null;
+  }
+}).filter(Objects::nonNull)
+.forEach(fileIdArray -> {
+  Arrays.stream(fileIdArray)
+  .filter(fileId -> !replaceFileids.add(fileId))
+  .findFirst()
+  .ifPresent(s -> {
+throw new HoodieException("Replace commit involves duplicate 
file id!");

Review Comment:
   I will revise this description



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] majian1998 commented on a diff in pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.

2023-08-21 Thread via GitHub


majian1998 commented on code in PR #9472:
URL: https://github.com/apache/hudi/pull/9472#discussion_r1300982795


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ReplaceCommitValidateUtil.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.cluster;
+
+import org.apache.hudi.common.model.HoodieReplaceCommitMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.JsonUtils;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.util.FileIOUtils.LOG;
+
+public class ReplaceCommitValidateUtil {
+  public static final String REPLACE_COMMIT_FILE_IDS = "replaceCommitFileIds";
+  public static void validateReplaceCommit(HoodieTableMetaClient metaClient) {
+metaClient.reloadActiveTimeline();
+Set replaceFileids = new HashSet<>();
+
+// Verify pending and completed replace commit
+
Stream.concat(metaClient.getActiveTimeline().getCompletedReplaceTimeline().getInstants().stream(),
+
metaClient.getActiveTimeline().filterInflights().filterPendingReplaceTimeline().getInstants().stream()).map(instant
 -> {
+  try {
+HoodieReplaceCommitMetadata replaceCommitMetadata =
+
HoodieReplaceCommitMetadata.fromBytes(metaClient.getActiveTimeline().getInstantDetails(instant).get(),
+HoodieReplaceCommitMetadata.class);
+if (!instant.isCompleted()) {
+  return 
JsonUtils.getObjectMapper().readValue(replaceCommitMetadata.getExtraMetadata().getOrDefault(REPLACE_COMMIT_FILE_IDS,
 "non-existent key"), String[].class);
+} else {
+  return 
replaceCommitMetadata.getPartitionToReplaceFileIds().values().stream()
+  .flatMap(List::stream)
+  .toArray(String[]::new);
+}
+  } catch (IOException e) {
+// If the key does not exist or there is a JSON parsing error, LOG 
reports an error and ignores it.
+LOG.error("Error when reading replace commit meta", e);
+return null;
+  }
+}).filter(Objects::nonNull)
+.forEach(fileIdArray -> {
+  Arrays.stream(fileIdArray)
+  .filter(fileId -> !replaceFileids.add(fileId))
+  .findFirst()
+  .ifPresent(s -> {
+throw new HoodieException("Replace commit involves duplicate 
file id!");

Review Comment:
   Reply “I think we better throw the exception here instead of return null”:
   I made the decision based on the following consideration: there may be some 
ways, such as constructing tests that do not create an inflight status timeline 
through normal execution processes. My additional validation information will 
not be added here, but if such a situation occurs during execution, an error 
will be reported here due to parsing exceptions. I do not want the additional 
validation procedures introduced here to affect the original logic. Without my 
validation information, the replace function will not be executed properly. 
Therefore, I only report the error through logging, but do not throw an 
exception. Do you think this makes sense?
   
   



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ReplaceCommitValidateUtil.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the 

[GitHub] [hudi] majian1998 commented on a diff in pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.

2023-08-21 Thread via GitHub


majian1998 commented on code in PR #9472:
URL: https://github.com/apache/hudi/pull/9472#discussion_r1300982366


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ReplaceCommitValidateUtil.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.cluster;
+
+import org.apache.hudi.common.model.HoodieReplaceCommitMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.JsonUtils;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.util.FileIOUtils.LOG;
+
+public class ReplaceCommitValidateUtil {
+  public static final String REPLACE_COMMIT_FILE_IDS = "replaceCommitFileIds";
+  public static void validateReplaceCommit(HoodieTableMetaClient metaClient) {
+metaClient.reloadActiveTimeline();
+Set replaceFileids = new HashSet<>();
+
+// Verify pending and completed replace commit
+
Stream.concat(metaClient.getActiveTimeline().getCompletedReplaceTimeline().getInstants().stream(),
+
metaClient.getActiveTimeline().filterInflights().filterPendingReplaceTimeline().getInstants().stream()).map(instant
 -> {

Review Comment:
   



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ReplaceCommitValidateUtil.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.cluster;
+
+import org.apache.hudi.common.model.HoodieReplaceCommitMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.JsonUtils;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.util.FileIOUtils.LOG;
+
+public class ReplaceCommitValidateUtil {
+  public static final String REPLACE_COMMIT_FILE_IDS = "replaceCommitFileIds";
+  public static void validateReplaceCommit(HoodieTableMetaClient metaClient) {
+metaClient.reloadActiveTimeline();
+Set replaceFileids = new HashSet<>();
+
+// Verify pending and completed replace commit
+
Stream.concat(metaClient.getActiveTimeline().getCompletedReplaceTimeline().getInstants().stream(),
+
metaClient.getActiveTimeline().filterInflights().filterPendingReplaceTimeline().getInstants().stream()).map(instant
 -> {

Review Comment:
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hehuiyuan commented on pull request #9476: Add detail exception when instant transition state

2023-08-21 Thread via GitHub


hehuiyuan commented on PR #9476:
URL: https://github.com/apache/hudi/pull/9476#issuecomment-1687472158

   Hi @danny0405 , take a look when you have time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] stream2000 commented on a diff in pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.

2023-08-21 Thread via GitHub


stream2000 commented on code in PR #9472:
URL: https://github.com/apache/hudi/pull/9472#discussion_r1300967476


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ReplaceCommitValidateUtil.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.cluster;
+
+import org.apache.hudi.common.model.HoodieReplaceCommitMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.JsonUtils;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.util.FileIOUtils.LOG;
+
+public class ReplaceCommitValidateUtil {
+  public static final String REPLACE_COMMIT_FILE_IDS = "replaceCommitFileIds";
+  public static void validateReplaceCommit(HoodieTableMetaClient metaClient) {
+metaClient.reloadActiveTimeline();
+Set replaceFileids = new HashSet<>();
+
+// Verify pending and completed replace commit
+
Stream.concat(metaClient.getActiveTimeline().getCompletedReplaceTimeline().getInstants().stream(),
+
metaClient.getActiveTimeline().filterInflights().filterPendingReplaceTimeline().getInstants().stream()).map(instant
 -> {
+  try {
+HoodieReplaceCommitMetadata replaceCommitMetadata =
+
HoodieReplaceCommitMetadata.fromBytes(metaClient.getActiveTimeline().getInstantDetails(instant).get(),
+HoodieReplaceCommitMetadata.class);
+if (!instant.isCompleted()) {
+  return 
JsonUtils.getObjectMapper().readValue(replaceCommitMetadata.getExtraMetadata().getOrDefault(REPLACE_COMMIT_FILE_IDS,
 "non-existent key"), String[].class);

Review Comment:
   use get instead and deal throw a readable exception when the 
`REPLACE_COMMIT_FILE_IDS` is null



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] stream2000 commented on a diff in pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.

2023-08-21 Thread via GitHub


stream2000 commented on code in PR #9472:
URL: https://github.com/apache/hudi/pull/9472#discussion_r1300966820


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ReplaceCommitValidateUtil.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.cluster;
+
+import org.apache.hudi.common.model.HoodieReplaceCommitMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.JsonUtils;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.util.FileIOUtils.LOG;
+
+public class ReplaceCommitValidateUtil {
+  public static final String REPLACE_COMMIT_FILE_IDS = "replaceCommitFileIds";
+  public static void validateReplaceCommit(HoodieTableMetaClient metaClient) {
+metaClient.reloadActiveTimeline();
+Set replaceFileids = new HashSet<>();
+
+// Verify pending and completed replace commit
+
Stream.concat(metaClient.getActiveTimeline().getCompletedReplaceTimeline().getInstants().stream(),
+
metaClient.getActiveTimeline().filterInflights().filterPendingReplaceTimeline().getInstants().stream()).map(instant
 -> {
+  try {
+HoodieReplaceCommitMetadata replaceCommitMetadata =
+
HoodieReplaceCommitMetadata.fromBytes(metaClient.getActiveTimeline().getInstantDetails(instant).get(),
+HoodieReplaceCommitMetadata.class);
+if (!instant.isCompleted()) {

Review Comment:
   Can we remove this check since we have already filtered out non-inflight 
instants? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #9444: [HUDI-6692] Don't default to bulk insert on nonpkless table if recordkey is omitted

2023-08-21 Thread via GitHub


nsivabalan commented on PR #9444:
URL: https://github.com/apache/hudi/pull/9444#issuecomment-168746

   @rahil-c : do you happened to know why TestFSUtilsWithRetryWrapperEnable is 
failing w/ java 17 specifically? if you check github actions, java 17 module is 
failing and when looked at the logs, I see TestFSUtilsWithRetryWrapperEnable is 
failing and is repeated for 204 times. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Riddle4045 commented on issue #9495: [SUPPORT] Writing Hudi tables with Flink fails with HFile exceptions

2023-08-21 Thread via GitHub


Riddle4045 commented on issue #9495:
URL: https://github.com/apache/hudi/issues/9495#issuecomment-168726

   > The missing class is already in the bundle jar right?
   
   @danny0405  you're right, just verified any clues what might be the issue 
here? the bundle exists in class path and Flink cluster is setup with hive 
3.1.2 & hadoop 3.3.2
   
![image](https://github.com/apache/hudi/assets/3648351/f9fbdd6b-bed8-4257-87ea-3629cc0e0d50)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Riddle4045 commented on issue #9435: [SUPPORT] Trino can't read tables created by Flink Hudi conector

2023-08-21 Thread via GitHub


Riddle4045 commented on issue #9435:
URL: https://github.com/apache/hudi/issues/9435#issuecomment-1687435446

   > > do you mean the Flink lib directory in the JM/TM
   > 
   > Yes
   
   @danny0405  unfortunately, the Flink RT pre-packaged and I don't have 
control over it. it has deps from hadoop 3.x and hive 3.1.2 based on the[ this 
document 
](https://hudi.apache.org/docs/faq#what-versions-of-hivesparkhadoop-are-support-by-hudi)
 it looks like Hudi doesn't work with Hadoop 3.x is that still true? or the doc 
is stale?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-1623) Support start_commit_time & end_commit_times for serializable incremental pull

2023-08-21 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1623:
-
Description: 
We suggest a new file naming for the *completed* metadata file:

${start_time}.${action}.${completion_time}

 

We also need a global *Time Generator* that can ensure the monotonical 
increasing generation of the timestamp, for example, maybe hold a mutex lock 
with the last generated timestamp backing up there. Say it may holds a lock 
{*}L1{*}. For each instant time generation, it needs guard from the lock.

 

Before creating the completed file, we also need a lock guard from L1.

 

Things need to note:
1. we only add completion timestamp to the completed metadata file;
2. we only add lock guard to the completed metadata file creation, not the 
whole commiting procedure;
3. for regular instant time generation, we also need a lock (that we should 
ship out by default)

  was:
We suggest a new file naming for the *completed* metadata file:

${start_time}.${action}.${completion_time}

 

We also need a global *Time Generator* that can ensure the monotonical 
increasing generation of the timestamp, for example, maybe hold a mutex lock 
with the last generated timestamp backing up there. Say it may holds a lock 
{*}L1{*}. For each instant time generation, it needs guard from the lock.

 

Before creating the completed file, we also need a lock guard from L1.

 

Things need to note:
1. we only add completion timestamp to the completed metadata file;
2. we only add lock guard to the completed metadata file creation, not the 
whole commiting procedure;
3. for regular instant time generation, we also need a lock.


> Support start_commit_time & end_commit_times for serializable incremental pull
> --
>
> Key: HUDI-1623
> URL: https://issues.apache.org/jira/browse/HUDI-1623
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Nishith Agarwal
>Assignee: Danny Chen
>Priority: Critical
> Fix For: 1.0.0
>
>
> We suggest a new file naming for the *completed* metadata file:
> ${start_time}.${action}.${completion_time}
>  
> We also need a global *Time Generator* that can ensure the monotonical 
> increasing generation of the timestamp, for example, maybe hold a mutex lock 
> with the last generated timestamp backing up there. Say it may holds a lock 
> {*}L1{*}. For each instant time generation, it needs guard from the lock.
>  
> Before creating the completed file, we also need a lock guard from L1.
>  
> Things need to note:
> 1. we only add completion timestamp to the completed metadata file;
> 2. we only add lock guard to the completed metadata file creation, not the 
> whole commiting procedure;
> 3. for regular instant time generation, we also need a lock (that we should 
> ship out by default)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-1623) Support start_commit_time & end_commit_times for serializable incremental pull

2023-08-21 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757197#comment-17757197
 ] 

Vinoth Chandar commented on HUDI-1623:
--

Agree on points 2 and 3. I am going to investigate if we can adopt some 
TrueTime APIs (I think we can, will do some reading) 

> Support start_commit_time & end_commit_times for serializable incremental pull
> --
>
> Key: HUDI-1623
> URL: https://issues.apache.org/jira/browse/HUDI-1623
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Nishith Agarwal
>Assignee: Danny Chen
>Priority: Critical
> Fix For: 1.0.0
>
>
> We suggest a new file naming for the *completed* metadata file:
> ${start_time}.${action}.${completion_time}
>  
> We also need a global *Time Generator* that can ensure the monotonical 
> increasing generation of the timestamp, for example, maybe hold a mutex lock 
> with the last generated timestamp backing up there. Say it may holds a lock 
> {*}L1{*}. For each instant time generation, it needs guard from the lock.
>  
> Before creating the completed file, we also need a lock guard from L1.
>  
> Things need to note:
> 1. we only add completion timestamp to the completed metadata file;
> 2. we only add lock guard to the completed metadata file creation, not the 
> whole commiting procedure;
> 3. for regular instant time generation, we also need a lock.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-1623) Support start_commit_time & end_commit_times for serializable incremental pull

2023-08-21 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757196#comment-17757196
 ] 

Vinoth Chandar commented on HUDI-1623:
--

> we only add completion timestamp to the completed metadata file;
I need to think through whether this helps ease any pains around syncing from 
DT to MT timelines. 

Instead of 

${start_time}.${action}.${completion_time},
 
should we do ?

${start_time}_${completion_time}.${action}

Trying to see what can make it easier to visually look at the timeline. 

> Support start_commit_time & end_commit_times for serializable incremental pull
> --
>
> Key: HUDI-1623
> URL: https://issues.apache.org/jira/browse/HUDI-1623
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Nishith Agarwal
>Assignee: Danny Chen
>Priority: Critical
> Fix For: 1.0.0
>
>
> We suggest a new file naming for the *completed* metadata file:
> ${start_time}.${action}.${completion_time}
>  
> We also need a global *Time Generator* that can ensure the monotonical 
> increasing generation of the timestamp, for example, maybe hold a mutex lock 
> with the last generated timestamp backing up there. Say it may holds a lock 
> {*}L1{*}. For each instant time generation, it needs guard from the lock.
>  
> Before creating the completed file, we also need a lock guard from L1.
>  
> Things need to note:
> 1. we only add completion timestamp to the completed metadata file;
> 2. we only add lock guard to the completed metadata file creation, not the 
> whole commiting procedure;
> 3. for regular instant time generation, we also need a lock.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9472:
URL: https://github.com/apache/hudi/pull/9472#issuecomment-1687403746

   
   ## CI report:
   
   * 1e493605d0a26b442efbf1518b063dbb1e616872 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19390)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19389)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19398)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] aajisaka commented on issue #6940: [SUPPORT] Loading preexisting (hudi 0.10) partitioned tables from hive metastore with hudi 0.12

2023-08-21 Thread via GitHub


aajisaka commented on issue #6940:
URL: https://github.com/apache/hudi/issues/6940#issuecomment-1687402488

   > (2) Turn off metadata-table-based file listing in BaseHoodieTableFileIndex 
https://github.com/apache/hudi/pull/7488 cherry-picked for 0.12.2 release
   
   Note that it's reverted by #7526 and the performance is worse in 0.12.3 and 
upper. Is there any option to revive this change?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-1623) Support start_commit_time & end_commit_times for serializable incremental pull

2023-08-21 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-1623:
-
Description: 
We suggest a new file naming for the *completed* metadata file:

${start_time}.${action}.${completion_time}

 

We also need a global *Time Generator* that can ensure the monotonical 
increasing generation of the timestamp, for example, maybe hold a mutex lock 
with the last generated timestamp backing up there. Say it may holds a lock 
{*}L1{*}. For each instant time generation, it needs guard from the lock.

 

Before creating the completed file, we also need a lock guard from L1.

 

Things need to note:
1. we only add completion timestamp to the completed metadata file;
2. we only add lock guard to the completed metadata file creation, not the 
whole commiting procedure;
3. for regular instant time generation, we also need a lock.

  was:
We suggest a new file naming for the *completed* metadata file:

${start_time}.${action}.${completion_time}

 

We also need a global *Time Generator* that can ensure the monotonical 
increasing generation of the timestamp, for example, maybe hold a mutex lock 
with the last generated timestamp backing up there. Say it may holds a lock 
{*}L1{*}. For each instant time generation, it needs guard from the lock.

 

Before creating the completed file, we also need a lock guard from L1.

 

Something to note:
1. we only add completion timestamp to the completed metadata file;
2. we only add lock guard to the completed metadata file creation, not the 
whole commiting procedure;
3. for regular instant time generation, we also need a lock.


> Support start_commit_time & end_commit_times for serializable incremental pull
> --
>
> Key: HUDI-1623
> URL: https://issues.apache.org/jira/browse/HUDI-1623
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Nishith Agarwal
>Assignee: Danny Chen
>Priority: Critical
> Fix For: 1.0.0
>
>
> We suggest a new file naming for the *completed* metadata file:
> ${start_time}.${action}.${completion_time}
>  
> We also need a global *Time Generator* that can ensure the monotonical 
> increasing generation of the timestamp, for example, maybe hold a mutex lock 
> with the last generated timestamp backing up there. Say it may holds a lock 
> {*}L1{*}. For each instant time generation, it needs guard from the lock.
>  
> Before creating the completed file, we also need a lock guard from L1.
>  
> Things need to note:
> 1. we only add completion timestamp to the completed metadata file;
> 2. we only add lock guard to the completed metadata file creation, not the 
> whole commiting procedure;
> 3. for regular instant time generation, we also need a lock.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #9484: [HUDI-6729] Fix get partition values from path for non-string type partition column

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9484:
URL: https://github.com/apache/hudi/pull/9484#issuecomment-1687374419

   
   ## CI report:
   
   * 905cc6b4eff305d54e52f4c1ac2d44d449e9afc5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19371)
 
   * a41670e0ef2c2d0433101644431f954bd617280d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19400)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9408: [HUDI-6671] Support 'alter table add partition' sql

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9408:
URL: https://github.com/apache/hudi/pull/9408#issuecomment-1687374293

   
   ## CI report:
   
   * fadda82b0444d09d8718bc9002fbd1964e18bbf2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19332)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19338)
 
   * 19933a6463f69912c589fad48a15ad3c8ca9050b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19399)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9484: [HUDI-6729] Fix get partition values from path for non-string type partition column

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9484:
URL: https://github.com/apache/hudi/pull/9484#issuecomment-1687369516

   
   ## CI report:
   
   * 905cc6b4eff305d54e52f4c1ac2d44d449e9afc5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19371)
 
   * a41670e0ef2c2d0433101644431f954bd617280d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9408: [HUDI-6671] Support 'alter table add partition' sql

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9408:
URL: https://github.com/apache/hudi/pull/9408#issuecomment-1687369397

   
   ## CI report:
   
   * fadda82b0444d09d8718bc9002fbd1964e18bbf2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19332)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19338)
 
   * 19933a6463f69912c589fad48a15ad3c8ca9050b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-6733) Add flink-metrics-dropwizard to flink bundle

2023-08-21 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6733.

Resolution: Fixed

Fixed via master branch: e2d47605738c71c4f2ddad0572a0d4c9fe0d58ad

> Add flink-metrics-dropwizard to flink bundle
> 
>
> Key: HUDI-6733
> URL: https://issues.apache.org/jira/browse/HUDI-6733
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6733) Add flink-metrics-dropwizard to flink bundle

2023-08-21 Thread Danny Chen (Jira)
Danny Chen created HUDI-6733:


 Summary: Add flink-metrics-dropwizard to flink bundle
 Key: HUDI-6733
 URL: https://issues.apache.org/jira/browse/HUDI-6733
 Project: Apache Hudi
  Issue Type: Bug
  Components: flink
Reporter: Danny Chen
 Fix For: 0.14.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated (ed5997348f5 -> e2d47605738)

2023-08-21 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from ed5997348f5 [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor 
Refactor & Added null Kafka Key test cases (#9459)
 add e2d47605738 [HUDI-6733] Add flink-metrics-dropwizard to flink bundle 
(#9499)

No new revisions were added by this update.

Summary of changes:
 packaging/hudi-flink-bundle/pom.xml | 1 +
 1 file changed, 1 insertion(+)



[GitHub] [hudi] danny0405 merged pull request #9499: [MINOR] Add flink-metrics-dropwizard to flink bundle

2023-08-21 Thread via GitHub


danny0405 merged PR #9499:
URL: https://github.com/apache/hudi/pull/9499


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] stream2000 opened a new pull request, #9499: [MINOR] Add flink-metrics-dropwizard to flink bundle

2023-08-21 Thread via GitHub


stream2000 opened a new pull request, #9499:
URL: https://github.com/apache/hudi/pull/9499

   ### Change Logs
   
   Add flink-metrics-dropwizard to flink bundle to fix ClassNotFound Exception
   
   ### Impact
   
   Add flink-metrics-dropwizard to flink bundle
   
   ### Risk level (write none, low medium or high below)
   
   NONE
   
   ### Documentation Update
   
   NONE
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wecharyu commented on a diff in pull request #9484: [HUDI-6729] Fix get partition values from path for non-string type partition column

2023-08-21 Thread via GitHub


wecharyu commented on code in PR #9484:
URL: https://github.com/apache/hudi/pull/9484#discussion_r1300861013


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##
@@ -483,24 +483,20 @@ abstract class HoodieBaseRelation(val sqlContext: 
SQLContext,
   protected def getPartitionColumnsAsInternalRowInternal(file: FileStatus, 
basePath: Path,
  
extractPartitionValuesFromPartitionPath: Boolean): InternalRow = {
 try {
-  val tableConfig = metaClient.getTableConfig
   if (extractPartitionValuesFromPartitionPath) {
 val tablePathWithoutScheme = 
CachingPath.getPathWithoutSchemeAndAuthority(basePath)
 val partitionPathWithoutScheme = 
CachingPath.getPathWithoutSchemeAndAuthority(file.getPath.getParent)
 val relativePath = new 
URI(tablePathWithoutScheme.toString).relativize(new 
URI(partitionPathWithoutScheme.toString)).toString
-val hiveStylePartitioningEnabled = 
tableConfig.getHiveStylePartitioningEnable.toBoolean
-if (hiveStylePartitioningEnabled) {
-  val partitionSpec = PartitioningUtils.parsePathFragment(relativePath)
-  
InternalRow.fromSeq(partitionColumns.map(partitionSpec(_)).map(UTF8String.fromString))
-} else {
-  if (partitionColumns.length == 1) {
-InternalRow.fromSeq(Seq(UTF8String.fromString(relativePath)))
-  } else {
-val parts = relativePath.split("/")
-assert(parts.size == partitionColumns.length)
-InternalRow.fromSeq(parts.map(UTF8String.fromString))
-  }
-}
+val timeZoneId = conf.get("timeZone", 
sparkSession.sessionState.conf.sessionLocalTimeZone)
+val rowValues = HoodieSparkUtils.parsePartitionColumnValues(

Review Comment:
   Addressed!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] boneanxs commented on a diff in pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.

2023-08-21 Thread via GitHub


boneanxs commented on code in PR #9472:
URL: https://github.com/apache/hudi/pull/9472#discussion_r1300858914


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ReplaceCommitValidateUtil.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.cluster;
+
+import org.apache.hudi.common.model.HoodieReplaceCommitMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.JsonUtils;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.util.FileIOUtils.LOG;
+
+public class ReplaceCommitValidateUtil {
+  public static final String REPLACE_COMMIT_FILE_IDS = "replaceCommitFileIds";
+  public static void validateReplaceCommit(HoodieTableMetaClient metaClient) {
+metaClient.reloadActiveTimeline();
+Set replaceFileids = new HashSet<>();
+
+// Verify pending and completed replace commit
+
Stream.concat(metaClient.getActiveTimeline().getCompletedReplaceTimeline().getInstants().stream(),
+
metaClient.getActiveTimeline().filterInflights().filterPendingReplaceTimeline().getInstants().stream()).map(instant
 -> {
+  try {
+HoodieReplaceCommitMetadata replaceCommitMetadata =
+
HoodieReplaceCommitMetadata.fromBytes(metaClient.getActiveTimeline().getInstantDetails(instant).get(),
+HoodieReplaceCommitMetadata.class);
+if (!instant.isCompleted()) {
+  return 
JsonUtils.getObjectMapper().readValue(replaceCommitMetadata.getExtraMetadata().getOrDefault(REPLACE_COMMIT_FILE_IDS,
 "non-existent key"), String[].class);
+} else {
+  return 
replaceCommitMetadata.getPartitionToReplaceFileIds().values().stream()
+  .flatMap(List::stream)
+  .toArray(String[]::new);
+}
+  } catch (IOException e) {
+// If the key does not exist or there is a JSON parsing error, LOG 
reports an error and ignores it.
+LOG.error("Error when reading replace commit meta", e);
+return null;
+  }
+}).filter(Objects::nonNull)
+.forEach(fileIdArray -> {
+  Arrays.stream(fileIdArray)
+  .filter(fileId -> !replaceFileids.add(fileId))
+  .findFirst()
+  .ifPresent(s -> {
+throw new HoodieException("Replace commit involves duplicate 
file id!");

Review Comment:
   Can make the exception more clear



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ReplaceCommitValidateUtil.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.cluster;
+
+import org.apache.hudi.common.model.HoodieReplaceCommitMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.JsonUtils;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Stream;
+
+import static 

[GitHub] [hudi] yyh2954360585 commented on issue #9471: [SUPPORT] When using Deltasteamer JdbcSource to extract data, there are issues with data loss and slow query of source side data

2023-08-21 Thread via GitHub


yyh2954360585 commented on issue #9471:
URL: https://github.com/apache/hudi/issues/9471#issuecomment-1687328500

   > > @yyh2954360585 JDBC is slow and put lot of load on source system. So 
full query a full query on large table can cause high load or even downtime to 
the database server. You can set the value of source-limit according to your 
dataset and requirement. You can even set it to a very high value.
   > 
   > If I set source limit=1000, then I can only extract 1000 pieces of data 
from the source table, which is not reasonable. Because it has no offset.
   > 
   > 
https://github.com/apache/hudi/blob/ba5ab8ca46863a67023e7172fb16a9a36d3b5acb/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JdbcSource.java#L239-L252
   
   So there will be another issue. If I have a table with a data volume of 10 
million and do not set the source limit, it will perform a full query on the 
source table. PpdQuery is a subquery that, according to the SQL execution plan, 
executes the subquery first and then the outer layer. If using jdbc. fetchsize, 
the condition for fetchsize will only be at the outermost layer


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yyh2954360585 commented on issue #9471: [SUPPORT] When using Deltasteamer JdbcSource to extract data, there are issues with data loss and slow query of source side data

2023-08-21 Thread via GitHub


yyh2954360585 commented on issue #9471:
URL: https://github.com/apache/hudi/issues/9471#issuecomment-1687318170

   > > @yyh2954360585 JDBC is slow and put lot of load on source system. So 
full query a full query on large table can cause high load or even downtime to 
the database server. You can set the value of source-limit according to your 
dataset and requirement. You can even set it to a very high value.
   > 
   > If I set source limit=1000, then I can only extract 1000 pieces of data 
from the source table, which is not reasonable. Because it has no offset
   
   
https://github.com/apache/hudi/blob/ba5ab8ca46863a67023e7172fb16a9a36d3b5acb/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JdbcSource.java#L239-L252


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yyh2954360585 commented on issue #9471: [SUPPORT] When using Deltasteamer JdbcSource to extract data, there are issues with data loss and slow query of source side data

2023-08-21 Thread via GitHub


yyh2954360585 commented on issue #9471:
URL: https://github.com/apache/hudi/issues/9471#issuecomment-1687317983

   
https://github.com/apache/hudi/blob/ba5ab8ca46863a67023e7172fb16a9a36d3b5acb/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JdbcSource.java#L239-L252


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yyh2954360585 commented on issue #9471: [SUPPORT] When using Deltasteamer JdbcSource to extract data, there are issues with data loss and slow query of source side data

2023-08-21 Thread via GitHub


yyh2954360585 commented on issue #9471:
URL: https://github.com/apache/hudi/issues/9471#issuecomment-1687316048

   
https://github.com/apache/hudi/blob/ba5ab8ca46863a67023e7172fb16a9a36d3b5acb/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JdbcSource.java#L240


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zyclove commented on issue #9470: [SUPPORT] spark-sql hudi 0.12.3 Caused by: org.apache.avro.AvroTypeException: Found long, expecting union

2023-08-21 Thread via GitHub


zyclove commented on issue #9470:
URL: https://github.com/apache/hudi/issues/9470#issuecomment-1687306836

   @ad1happy2go 
   This table has been written using version 0.12.3 data, but the schema 
incompatibility problem occurred last time. Because it was an online problem, I 
cleaned up all the .hoodie directories, but did not clean up the historical 
data files, and then restarted the task to restore it.
   
   Will there be any problem with this operation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #9494: [SUPPORT] Querying Hudi Table via Presto Hive Connector Errors out when having DecimalType Column.

2023-08-21 Thread via GitHub


danny0405 commented on issue #9494:
URL: https://github.com/apache/hudi/issues/9494#issuecomment-1687305837

   cc @codope Can you take a look ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yyh2954360585 commented on issue #9471: [SUPPORT] When using Deltasteamer JdbcSource to extract data, there are issues with data loss and slow query of source side data

2023-08-21 Thread via GitHub


yyh2954360585 commented on issue #9471:
URL: https://github.com/apache/hudi/issues/9471#issuecomment-1687304692

   > @yyh2954360585 JDBC is slow and put lot of load on source system. So full 
query a full query on large table can cause high load or even downtime to the 
database server. You can set the value of source-limit according to your 
dataset and requirement. You can even set it to a very high value.
   
   If I set source limit=1000, then I can only extract 1000 pieces of data from 
the source table, which is not reasonable. Because it has no offset


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #9491: [HUDI-6732] Allow wildcards from Spark-SQL entrypoints for drop parti…

2023-08-21 Thread via GitHub


danny0405 commented on PR #9491:
URL: https://github.com/apache/hudi/pull/9491#issuecomment-1687296175

   Looks good, can you do a final reivew @boneanxs ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] beyond1920 commented on a diff in pull request #4913: [HUDI-1517] create marker file for every log file

2023-08-21 Thread via GitHub


beyond1920 commented on code in PR #4913:
URL: https://github.com/apache/hudi/pull/4913#discussion_r1300814472


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java:
##
@@ -273,4 +280,31 @@ protected static Option 
toAvroRecord(HoodieRecord record, Schema
   return Option.empty();
 }
   }
+
+  protected class AppendLogWriteCallback implements HoodieLogFileWriteCallback 
{
+// here we distinguish log files created from log files being appended. 
Considering following scenario:
+// An appending task write to log file.
+// (1) append to existing file file_instant_writetoken1.log.1
+// (2) rollover and create file file_instant_writetoken2.log.2
+// Then this task failed and retry by a new task.
+// (3) append to existing file file_instant_writetoken1.log.1
+// (4) rollover and create file file_instant_writetoken3.log.2
+// finally file_instant_writetoken2.log.2 should not be committed to hudi, 
we use marker file to delete it.
+// keep in mind that log file is not always fail-safe unless it never roll 
over
+

Review Comment:
   @nsivabalan @guanziyue I'm interested in this issue. Please count me in for 
the discussion on this topic, I might be able to provide some input and other 
problem about this issue. My slack id is Jing Zhang.
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] beyond1920 commented on a diff in pull request #4913: [HUDI-1517] create marker file for every log file

2023-08-21 Thread via GitHub


beyond1920 commented on code in PR #4913:
URL: https://github.com/apache/hudi/pull/4913#discussion_r1300814472


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java:
##
@@ -273,4 +280,31 @@ protected static Option 
toAvroRecord(HoodieRecord record, Schema
   return Option.empty();
 }
   }
+
+  protected class AppendLogWriteCallback implements HoodieLogFileWriteCallback 
{
+// here we distinguish log files created from log files being appended. 
Considering following scenario:
+// An appending task write to log file.
+// (1) append to existing file file_instant_writetoken1.log.1
+// (2) rollover and create file file_instant_writetoken2.log.2
+// Then this task failed and retry by a new task.
+// (3) append to existing file file_instant_writetoken1.log.1
+// (4) rollover and create file file_instant_writetoken3.log.2
+// finally file_instant_writetoken2.log.2 should not be committed to hudi, 
we use marker file to delete it.
+// keep in mind that log file is not always fail-safe unless it never roll 
over
+

Review Comment:
   @nsivabalan @guanziyue I'm interested in this issue. Please count me in for 
the discussion on this topic; I might be able to provide some input. My slack 
id is Jing Zhang.
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9497: [HUDI-4756] Remove assume.date.partitioning configuration

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9497:
URL: https://github.com/apache/hudi/pull/9497#issuecomment-1687285081

   
   ## CI report:
   
   * 0707cfcf0cf82f7ce45159959b3378d7c89e6674 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19396)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch asf-site updated: [DOC] Fix the use of spark parameters in website documents (#9490)

2023-08-21 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 3acbd319605 [DOC] Fix the use of spark parameters in website documents 
(#9490)
3acbd319605 is described below

commit 3acbd319605d284fa289c75ff16d5a1a0684d5d3
Author: empcl <1515827...@qq.com>
AuthorDate: Tue Aug 22 09:36:57 2023 +0800

[DOC] Fix the use of spark parameters in website documents (#9490)

Co-authored-by: chenlei677 
---
 website/versioned_docs/version-0.10.1/table_management.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/website/versioned_docs/version-0.10.1/table_management.md 
b/website/versioned_docs/version-0.10.1/table_management.md
index b10b68cbf50..91372de0653 100644
--- a/website/versioned_docs/version-0.10.1/table_management.md
+++ b/website/versioned_docs/version-0.10.1/table_management.md
@@ -156,7 +156,7 @@ create table if not exists h3(
 options (
   primaryKey = 'id',
   type = 'mor',
-  ${hoodie.config.key1} = '${hoodie.config.value2}',
+  ${hoodie.config.key1} = '${hoodie.config.value1}',
   ${hoodie.config.key2} = '${hoodie.config.value2}',
   
 );



[GitHub] [hudi] danny0405 merged pull request #9490: fix the use of spark parameters in website documents

2023-08-21 Thread via GitHub


danny0405 merged PR #9490:
URL: https://github.com/apache/hudi/pull/9490


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9472:
URL: https://github.com/apache/hudi/pull/9472#issuecomment-1687279411

   
   ## CI report:
   
   * 1e493605d0a26b442efbf1518b063dbb1e616872 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19390)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19389)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19398)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] cbomgit opened a new issue, #9498: [SUPPORT] mergedSchema behavior when reading multiple hudi tables at once

2023-08-21 Thread via GitHub


cbomgit opened a new issue, #9498:
URL: https://github.com/apache/hudi/issues/9498

   Hello,
   
   I'm using Hudi 0.11 on Spark 3.2.1 on EMR 6.7.0. I have a ingestion pipeline 
where data is written out in daily batches and each batch is its own HUDI 
table. In order words, data is structured like so:
   
   ```
   basePath/
  2023-08-10//* this is a HUDI table */
   partitionColumn1/
 partitionColumn2/
  2023-08-11//* this is another HUDI table */
   partitionColumn1/
 partitionColumn2/
  ...
   ```
   
   The schema is common across all tables. We recently added a column to this 
schema and were trying to figure out the best way of handling this across out 
downstream data processing jobs, which will typically read in a date range of 
data of the above table.
   
   I found a solution utilizing the `mergeSchema` option. If I read all tables 
in like so, then my data is read in with the correct updated schema:
   
   ```
 def getHudiReadOptions(s3ReadPath: String): Map[String, String] = Map(
   DataSourceReadOptions.QUERY_TYPE.key() -> 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL,
   DataSourceReadOptions.READ_PATHS.key() -> s3ReadPath,
   HoodieMetadataConfig.ENABLE.key() -> "true"
 )
   
   val paths = List(
   "s3://daily-data/2023-08-14/*/*/*",/* this path has the old schema */
   "s3://daily-data/2023-08-15/*/*/*",/* this path has the new field */
   )
   val merged = spark.read.format("org.apache.hudi")
.options(getHudiReadOptions(paths.mkString(",")))
.option("mergeSchema", "true")
.load()
   ```
   
   If I do not include the `mergeSchema` option, then the data is read in but 
is missing the new field. My question is: is this expected behavior? Can we 
rely on the mergeSchema option to handle these kinds of schema differences? I 
have read through the schema evolution documentation listed here 
(https://hudi.apache.org/docs/0.11.0/schema_evolution) but have not seen a 
mention of using this option. Our use case is not typical with multiple hudi 
tables per day of data, so wanted to check to ensure that this behavior is 
reliable.
   
   **Environment Description**
   
   * Hudi version : 0.11
   
   * Spark version : 3.2.1
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] majian1998 commented on pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.

2023-08-21 Thread via GitHub


majian1998 commented on PR #9472:
URL: https://github.com/apache/hudi/pull/9472#issuecomment-1687273434

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] cbomgit closed issue #9481: [SUPPORT] Potentially Incorret or incomplete Documentation

2023-08-21 Thread via GitHub


cbomgit closed issue #9481: [SUPPORT] Potentially Incorret or incomplete 
Documentation
URL: https://github.com/apache/hudi/issues/9481


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Jason-liujc commented on issue #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

2023-08-21 Thread via GitHub


Jason-liujc commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1687260803

   We are encountering the same issue. After using DynamoDB as the lock table, 
we still see this error: `java.util.ConcurrentModificationException: Cannot 
resolve conflicts for overlapping writes`
   
   What I observed:
   1. I have 4 EMR Spark clusters that writes to the same table. One by one, it 
fails with the above error. When I look at the DynamoDB lock history, I see 
locks constantly getting created and released. 
   2. The DynamoDB lock is not at file level, but on the table level. So two 
Hudi jobs might try to write to the same files and one of them failure. It 
seems if there are a couple of concurrent jobs running at the same time writing 
to the same files, it'll go into some sort of failure storm, which might fail 
everything unless you set a really really high retry threshold.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #9495: [SUPPORT] Writing Hudi tables with Flink fails with HFile exceptions

2023-08-21 Thread via GitHub


danny0405 commented on issue #9495:
URL: https://github.com/apache/hudi/issues/9495#issuecomment-1687260274

   The missing class is already in the bundle jar right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 merged pull request #9459: [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null Kafka Key test cases

2023-08-21 Thread via GitHub


danny0405 merged PR #9459:
URL: https://github.com/apache/hudi/pull/9459


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null Kafka Key test cases (#9459)

2023-08-21 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new ed5997348f5 [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor 
Refactor & Added null Kafka Key test cases (#9459)
ed5997348f5 is described below

commit ed5997348f5284e107f0ca177241aa5ffc832f62
Author: Prathit malik <53890994+prathi...@users.noreply.github.com>
AuthorDate: Tue Aug 22 06:31:47 2023 +0530

[HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null 
Kafka Key test cases (#9459)
---
 .../hudi/utilities/sources/JsonKafkaSource.java|  2 +-
 .../utilities/sources/helpers/AvroConvertor.java   | 11 
 .../utilities/sources/TestAvroKafkaSource.java | 30 ++
 .../utilities/sources/TestJsonKafkaSource.java | 14 ++
 .../utilities/testutils/UtilitiesTestBase.java |  9 +++
 5 files changed, 60 insertions(+), 6 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java
index f31c9b7e542..eb67abfee3a 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java
@@ -81,7 +81,7 @@ public class JsonKafkaSource extends KafkaSource {
 ObjectMapper om = new ObjectMapper();
 partitionIterator.forEachRemaining(consumerRecord -> {
   String recordValue = consumerRecord.value().toString();
-  String recordKey = consumerRecord.key().toString();
+  String recordKey = StringUtils.objToString(consumerRecord.key());
   try {
 ObjectNode jsonNode = (ObjectNode) om.readTree(recordValue);
 jsonNode.put(KAFKA_SOURCE_OFFSET_COLUMN, consumerRecord.offset());
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/AvroConvertor.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/AvroConvertor.java
index 89191cb465c..f9c35bd3b6e 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/AvroConvertor.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/AvroConvertor.java
@@ -19,6 +19,7 @@
 package org.apache.hudi.utilities.sources.helpers;
 
 import org.apache.hudi.avro.MercifulJsonConverter;
+import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.internal.schema.HoodieSchemaException;
 
 import com.google.protobuf.Message;
@@ -171,16 +172,16 @@ public class AvroConvertor implements Serializable {
*/
   public GenericRecord withKafkaFieldsAppended(ConsumerRecord consumerRecord) {
 initSchema();
-GenericRecord record = (GenericRecord) consumerRecord.value();
+GenericRecord recordValue = (GenericRecord) consumerRecord.value();
 GenericRecordBuilder recordBuilder = new GenericRecordBuilder(this.schema);
-for (Schema.Field field :  record.getSchema().getFields()) {
-  recordBuilder.set(field, record.get(field.name()));
+for (Schema.Field field :  recordValue.getSchema().getFields()) {
+  recordBuilder.set(field, recordValue.get(field.name()));
 }
-
+String recordKey = StringUtils.objToString(consumerRecord.key());
 recordBuilder.set(KAFKA_SOURCE_OFFSET_COLUMN, consumerRecord.offset());
 recordBuilder.set(KAFKA_SOURCE_PARTITION_COLUMN, 
consumerRecord.partition());
 recordBuilder.set(KAFKA_SOURCE_TIMESTAMP_COLUMN, 
consumerRecord.timestamp());
-recordBuilder.set(KAFKA_SOURCE_KEY_COLUMN, 
consumerRecord.key().toString());
+recordBuilder.set(KAFKA_SOURCE_KEY_COLUMN, recordKey);
 return recordBuilder.build();
   }
 
diff --git 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestAvroKafkaSource.java
 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestAvroKafkaSource.java
index 2632f72659b..16ec4545665 100644
--- 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestAvroKafkaSource.java
+++ 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestAvroKafkaSource.java
@@ -62,6 +62,7 @@ import static 
org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SO
 import static 
org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_TIMESTAMP_COLUMN;
 import static 
org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_KEY_COLUMN;
 import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertNull;
 import static org.mockito.Mockito.mock;
 
 public class TestAvroKafkaSource extends SparkClientFunctionalTestHarness {
@@ -113,6 +114,17 @@ public class TestAvroKafkaSource extends 
SparkClientFunctionalTestHarness {
 }
   }
 
+  

[GitHub] [hudi] danny0405 commented on pull request #9459: [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null Kafka Key test cases

2023-08-21 Thread via GitHub


danny0405 commented on PR #9459:
URL: https://github.com/apache/hudi/pull/9459#issuecomment-1687258178

   Tests have passed: 
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=19345=results


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Riddle4045 commented on issue #9435: [SUPPORT] Trino can't read tables created by Flink Hudi conector

2023-08-21 Thread via GitHub


Riddle4045 commented on issue #9435:
URL: https://github.com/apache/hudi/issues/9435#issuecomment-1687220558

   @danny0405  I looked at the JM logs and it looks like the same issue about 
`calcite/plan/RelOptRule` not being found, what's interesting is that even when 
I manually ADD JAR to my SQL Client I am seeing this exception - does classpath 
for compaction task differ from the job at all?
   
   StackTracke 
   
   ```
   2023-08-21 23:56:26.647 [] flink-akka.actor.default-dispatcher-19 INFO  
flink apache.flink.runtime.executiongraph.ExecutionGraph 561 Deploying Source: 
Values[1] -> Calc[2] -> row_data_to_hoodie_record (1/1) (attempt #0) with 
attempt id 
1ce0c00a771c0214a3141bd885bcfdb0_cbc357ccb763df2852fee8c4fc7d55f2_0_0 and 
vertex id cbc357ccb763df2852fee8c4fc7d55f2_0 to 10.244.7.3:6122-d80422 @ 
10.244.7.3 (dataPort=40143) with allocation id 2786e9f15bcf024e1208268a4eb1a9f4
   2023-08-21 23:56:26.656 [] flink-akka.actor.default-dispatcher-19 INFO  
flink apache.flink.runtime.executiongraph.ExecutionGraph 1435 bucket_assigner 
(1/1) (1ce0c00a771c0214a3141bd885bcfdb0_e5df0a348bd9bf84b2755743802fc12d_0_0) 
switched from SCHEDULED to DEPLOYING.
   2023-08-21 23:56:26.656 [] flink-akka.actor.default-dispatcher-19 INFO  
flink apache.flink.runtime.executiongraph.ExecutionGraph 561 Deploying 
bucket_assigner (1/1) (attempt #0) with attempt id 
1ce0c00a771c0214a3141bd885bcfdb0_e5df0a348bd9bf84b2755743802fc12d_0_0 and 
vertex id e5df0a348bd9bf84b2755743802fc12d_0 to 10.244.7.3:6122-d80422 @ 
10.244.7.3 (dataPort=40143) with allocation id 2786e9f15bcf024e1208268a4eb1a9f4
   2023-08-21 23:56:26.656 [] flink-akka.actor.default-dispatcher-19 INFO  
flink apache.flink.runtime.executiongraph.ExecutionGraph 1435 stream_write: mor 
(1/1) (1ce0c00a771c0214a3141bd885bcfdb0_de00fca8057a992b9689a76776a87e88_0_0) 
switched from SCHEDULED to DEPLOYING.
   2023-08-21 23:56:26.657 [] flink-akka.actor.default-dispatcher-19 INFO  
flink apache.flink.runtime.executiongraph.ExecutionGraph 561 Deploying 
stream_write: mor (1/1) (attempt #0) with attempt id 
1ce0c00a771c0214a3141bd885bcfdb0_de00fca8057a992b9689a76776a87e88_0_0 and 
vertex id de00fca8057a992b9689a76776a87e88_0 to 10.244.7.3:6122-d80422 @ 
10.244.7.3 (dataPort=40143) with allocation id 2786e9f15bcf024e1208268a4eb1a9f4
   2023-08-21 23:56:26.657 [] flink-akka.actor.default-dispatcher-19 INFO  
flink apache.flink.runtime.executiongraph.ExecutionGraph 1435 
compact_plan_generate (1/1) 
(1ce0c00a771c0214a3141bd885bcfdb0_454d170138cbd567f9c11b29e690b935_0_0) 
switched from SCHEDULED to DEPLOYING.
   2023-08-21 23:56:26.657 [] flink-akka.actor.default-dispatcher-19 INFO  
flink apache.flink.runtime.executiongraph.ExecutionGraph 561 Deploying 
compact_plan_generate (1/1) (attempt #0) with attempt id 
1ce0c00a771c0214a3141bd885bcfdb0_454d170138cbd567f9c11b29e690b935_0_0 and 
vertex id 454d170138cbd567f9c11b29e690b935_0 to 10.244.7.3:6122-d80422 @ 
10.244.7.3 (dataPort=40143) with allocation id 2786e9f15bcf024e1208268a4eb1a9f4
   2023-08-21 23:56:26.657 [] flink-akka.actor.default-dispatcher-19 INFO  
flink apache.flink.runtime.executiongraph.ExecutionGraph 1435 compact_task -> 
Sink: compact_commit (1/1) 
(1ce0c00a771c0214a3141bd885bcfdb0_848dfd76319f29eb5fbacd6c182b1d35_0_0) 
switched from SCHEDULED to DEPLOYING.
   2023-08-21 23:56:26.657 [] flink-akka.actor.default-dispatcher-19 INFO  
flink apache.flink.runtime.executiongraph.ExecutionGraph 561 Deploying 
compact_task -> Sink: compact_commit (1/1) (attempt #0) with attempt id 
1ce0c00a771c0214a3141bd885bcfdb0_848dfd76319f29eb5fbacd6c182b1d35_0_0 and 
vertex id 848dfd76319f29eb5fbacd6c182b1d35_0 to 10.244.7.3:6122-d80422 @ 
10.244.7.3 (dataPort=40143) with allocation id 2786e9f15bcf024e1208268a4eb1a9f4
   2023-08-21 23:56:28.173 [] flink-akka.actor.default-dispatcher-15 INFO  
flink apache.flink.runtime.executiongraph.ExecutionGraph 1435 stream_write: mor 
(1/1) (1ce0c00a771c0214a3141bd885bcfdb0_de00fca8057a992b9689a76776a87e88_0_0) 
switched from DEPLOYING to INITIALIZING.
   2023-08-21 23:56:28.173 [] flink-akka.actor.default-dispatcher-15 INFO  
flink apache.flink.runtime.executiongraph.ExecutionGraph 1435 compact_task -> 
Sink: compact_commit (1/1) 
(1ce0c00a771c0214a3141bd885bcfdb0_848dfd76319f29eb5fbacd6c182b1d35_0_0) 
switched from DEPLOYING to INITIALIZING.
   2023-08-21 23:56:28.187 [] flink-akka.actor.default-dispatcher-14 INFO  
flink apache.flink.runtime.executiongraph.ExecutionGraph 1435 
compact_plan_generate (1/1) 
(1ce0c00a771c0214a3141bd885bcfdb0_454d170138cbd567f9c11b29e690b935_0_0) 
switched from DEPLOYING to INITIALIZING.
   2023-08-21 23:56:28.187 [] flink-akka.actor.default-dispatcher-14 INFO  
flink apache.flink.runtime.executiongraph.ExecutionGraph 1435 Source: Values[1] 
-> Calc[2] -> row_data_to_hoodie_record (1/1) 
(1ce0c00a771c0214a3141bd885bcfdb0_cbc357ccb763df2852fee8c4fc7d55f2_0_0) 
switched from DEPLOYING to INITIALIZING.
   2023-08-21 23:56:28.188 [] 

[jira] [Commented] (HUDI-6712) Implement optimized keyed lookup on parquet files

2023-08-21 Thread Lin Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757157#comment-17757157
 ] 

Lin Liu commented on HUDI-6712:
---

Based on these PRs, e.g., [https://github.com/onehouseinc/lake-plumber/pull/7,] 
[https://github.com/onehouseinc/lake-plumber/pull/5,] 
[https://github.com/onehouseinc/lake-plumber/pull/8,] 
[https://github.com/onehouseinc/lake-plumber/pull/9,] etc, will update the 
corresponding logic and do experiments.

> Implement optimized keyed lookup on parquet files
> -
>
> Key: HUDI-6712
> URL: https://issues.apache.org/jira/browse/HUDI-6712
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Vinoth Chandar
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> Parquet performs poorly when performing a lookup of specific records, based 
> on a single key lookup column. 
> e.g: select * from parquet where key in ("a","b", "c) (SQL)
> e.g: List lookup(parquetFile, Set keys) (code) 
> Let's implement a reader, that is optimized for this pattern, by scanning 
> least amount of data. 
> Requirements: 
> 1. Need to support multiple values for same key. 
> 2. Can assume the file is sorted by the key/lookup field. 
> 3. Should handle non-existence of keys.
> 4. Should leverage parquet metadata (bloom filters, column index, ... ) to 
> minimize read read. 
> 5. Must to the minimum about of RPC calls to cloud storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6731) Allow MoR Read-Optimized BigQuery Sync

2023-08-21 Thread Timothy Brown (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Brown reassigned HUDI-6731:
---

Assignee: Timothy Brown

> Allow MoR Read-Optimized BigQuery Sync
> --
>
> Key: HUDI-6731
> URL: https://issues.apache.org/jira/browse/HUDI-6731
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Minor
>  Labels: pull-request-available
>
> Allow users to query their Hudi MoR tables with BigQuery in a read-optimized 
> manner by syncing the base files to BigQuery like we do for CoW tables today.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #9497: [HUDI-4756] Remote assume.date.partitioning configuration

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9497:
URL: https://github.com/apache/hudi/pull/9497#issuecomment-1687165968

   
   ## CI report:
   
   * 0707cfcf0cf82f7ce45159959b3378d7c89e6674 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19396)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9497: [HUDI-4756] Remote assume.date.partitioning configuration

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9497:
URL: https://github.com/apache/hudi/pull/9497#issuecomment-1687158189

   
   ## CI report:
   
   * 0707cfcf0cf82f7ce45159959b3378d7c89e6674 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] linliu-code opened a new pull request, #9497: Remote date partition

2023-08-21 Thread via GitHub


linliu-code opened a new pull request, #9497:
URL: https://github.com/apache/hudi/pull/9497

   ### Change Logs
   
   Remove  assume.date.partitioning config from Hudi
   
   ### Impact
   
   Hudi cannot rely turning this configuration to do some optimization.
   But this simplifies the logic of the codebase, and removes the special logic 
of handling for partition creation.
   
   ### Risk level (write none, low medium or high below)
   
   After we confirm that no tables are using this config, no users should be 
impacted.
   
   ### Documentation Update
   
   Old documentations about this config should be removed.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] linliu-code opened a new pull request, #9496: [HUDI-4756] remove unused config "hudi.assume.date.partitioning"

2023-08-21 Thread via GitHub


linliu-code opened a new pull request, #9496:
URL: https://github.com/apache/hudi/pull/9496

   ### Change Logs
   
   Removed related code that support "assume.date.partition" configuration.
   
   ### Impact
   
   None
   
   ### Risk level (write none, low medium or high below)
   
   After we check if no tables use this config, the risk should be low or None.
   
   ### Documentation Update
   
   We should remove the configuration from Hudi documentation.
   
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] linliu-code closed pull request #9496: [HUDI-4756] remove unused config "hudi.assume.date.partitioning"

2023-08-21 Thread via GitHub


linliu-code closed pull request #9496: [HUDI-4756] remove unused config 
"hudi.assume.date.partitioning"
URL: https://github.com/apache/hudi/pull/9496


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] linliu-code closed pull request #9466: [HUDI-4756] Remove unused config "hoodie.assume.date.partitioning"

2023-08-21 Thread via GitHub


linliu-code closed pull request #9466: [HUDI-4756] Remove unused config 
"hoodie.assume.date.partitioning"
URL: https://github.com/apache/hudi/pull/9466


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9488: [HUDI-6731] BigQuerySyncTool: add flag to allow for read optimized sync for MoR tables

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9488:
URL: https://github.com/apache/hudi/pull/9488#issuecomment-1687091730

   
   ## CI report:
   
   * f2619c32ef7543b508b830ebeae2e6d4c1bae591 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19394)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9466: [HUDI-4756] Remove unused config "hoodie.assume.date.partitioning"

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9466:
URL: https://github.com/apache/hudi/pull/9466#issuecomment-1687091632

   
   ## CI report:
   
   * d61eae7b243d92629914d2b95637922db6be3b08 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19337)
 
   * 0707cfcf0cf82f7ce45159959b3378d7c89e6674 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Riddle4045 opened a new issue, #9495: [SUPPORT] Writing Hudi tables with Flink fails with HFile exceptions

2023-08-21 Thread via GitHub


Riddle4045 opened a new issue, #9495:
URL: https://github.com/apache/hudi/issues/9495

   Using Flink version : 1.16
   Trying to write a hudi table with metadata sync enabled to HMS,  sample 
DStream API job 
   
   ```
   final StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
   // generate a source of data.
   DataStream fares = env.addSource(new 
TaxiFareGenerator()).map(
   event -> GenericRowData.of(
   event.getRideId(),
   event.getDriverId(),
   event.getTaxiId(),
   event.getStartTime(),
   event.getTip(),
   event.getTolls(),
   event.getTotalFare()//,
   //event.getPaymentType()
   ));
   
   String targetTable = "TaxiFare";
   String outputPath = String.join("/",basePath, "hudi_mor_2023082101");
   Map options = new HashMap<>();
   
   options.put(FlinkOptions.PATH.key(), outputPath);
   options.put(FlinkOptions.TABLE_TYPE.key(), 
HoodieTableType.MERGE_ON_READ.name());
   options.put(FlinkOptions.METADATA_ENABLED.key(), "true");
   options.put(FlinkOptions.HIVE_SYNC_ENABLED.key(), "true");
   options.put(FlinkOptions.HIVE_SYNC_MODE.key(), 
HiveSyncMode.HMS.name());
   options.put(FlinkOptions.HIVE_SYNC_METASTORE_URIS.key(), 
"thrift://hive-metastore:9083 ");
   
   HoodiePipeline.Builder builder = HoodiePipeline.builder(targetTable)
   .column("rideId BIGINT")
   .column("driverId BIGINT")
   .column("taxiId BIGINT")
   .column("startTime BIGINT")
   .column("tip FLOAT")
   .column("tolls FLOAT")
   .column("totalFare FLOAT")
   .pk("driverId")
   .options(options);
   
   builder.sink(fares, false);
   env.execute("Hudi Table with HMS");
   ```
   
   
   The job fails with the following exceptions 
   
   ```
   2023-08-21 14:30:42
   org.apache.flink.util.FlinkException: Global failure triggered by 
OperatorCoordinator for 'stream_write: TaxiFare' (operator 
07008715da12a894ffb19d48ffca33da).
at 
org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder$LazyInitializedCoordinatorContext.failJob(OperatorCoordinatorHolder.java:617)
at 
org.apache.hudi.sink.StreamWriteOperatorCoordinator.lambda$start$0(StreamWriteOperatorCoordinator.java:190)
at 
org.apache.hudi.sink.utils.NonThrownExecutor.handleException(NonThrownExecutor.java:142)
at 
org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:133)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)
at java.base/java.lang.Thread.run(Unknown Source)
   Caused by: org.apache.hudi.exception.HoodieException: Executor executes 
action [initialize instant 20230821213020770] error
... 6 more
   Caused by: org.apache.hudi.exception.HoodieException: Failed to update 
metadata
at 
org.apache.hudi.client.HoodieFlinkTableServiceClient.writeTableMetadata(HoodieFlinkTableServiceClient.java:181)
at 
org.apache.hudi.client.HoodieFlinkWriteClient.writeTableMetadata(HoodieFlinkWriteClient.java:279)
at 
org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:282)
at 
org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:233)
at 
org.apache.hudi.client.HoodieFlinkWriteClient.commit(HoodieFlinkWriteClient.java:111)
at 
org.apache.hudi.client.HoodieFlinkWriteClient.commit(HoodieFlinkWriteClient.java:74)
at 
org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:199)
at 
org.apache.hudi.sink.StreamWriteOperatorCoordinator.doCommit(StreamWriteOperatorCoordinator.java:537)
at 
org.apache.hudi.sink.StreamWriteOperatorCoordinator.commitInstant(StreamWriteOperatorCoordinator.java:513)
at 
org.apache.hudi.sink.StreamWriteOperatorCoordinator.commitInstant(StreamWriteOperatorCoordinator.java:484)
at 
org.apache.hudi.sink.StreamWriteOperatorCoordinator.lambda$initInstant$6(StreamWriteOperatorCoordinator.java:402)
at 
org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:130)
... 3 more
   Caused by: org.apache.hudi.exception.HoodieUpsertException: Error upsetting 
bucketType UPDATE for partition :files
at 
org.apache.hudi.table.action.commit.BaseFlinkCommitActionExecutor.handleUpsertPartition(BaseFlinkCommitActionExecutor.java:203)
at 

[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9422:
URL: https://github.com/apache/hudi/pull/9422#issuecomment-1687074168

   
   ## CI report:
   
   * 42d026cd694d6368e45b058a4ff7a9bd36b0d3a2 UNKNOWN
   * aa45782a86a9632784fa95c4c0cda62c6245bcb6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19393)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Riddle4045 commented on issue #9435: [SUPPORT] Trino can't read tables created by Flink Hudi conector

2023-08-21 Thread via GitHub


Riddle4045 commented on issue #9435:
URL: https://github.com/apache/hudi/issues/9435#issuecomment-1687040392

   @danny0405  sorry for the late response, here is the DAG. From the DAG it 
looks like compaction is scheduled.
   Interesting this to note here that I am running into the same issue with COW 
table flink writes too, trino results empty result when querying them
   
![image](https://github.com/apache/hudi/assets/3648351/fe512c10-36cb-4221-b33e-591710b776a5)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9444: [HUDI-6692] Don't default to bulk insert on nonpkless table if recordkey is omitted

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9444:
URL: https://github.com/apache/hudi/pull/9444#issuecomment-1687011964

   
   ## CI report:
   
   * 20773c0c6c03fa355e200867ab3dfc7cfc36d456 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19392)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9467: [HUDI-6621] Fix downgrade handler for 0.14.0

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9467:
URL: https://github.com/apache/hudi/pull/9467#issuecomment-1686918146

   
   ## CI report:
   
   * 6cba53453621aa518273a8879eeea3081f4c94cb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19384)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] guanziyue commented on a diff in pull request #4913: [HUDI-1517] create marker file for every log file

2023-08-21 Thread via GitHub


guanziyue commented on code in PR #4913:
URL: https://github.com/apache/hudi/pull/4913#discussion_r1300551764


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java:
##
@@ -273,4 +280,31 @@ protected static Option 
toAvroRecord(HoodieRecord record, Schema
   return Option.empty();
 }
   }
+
+  protected class AppendLogWriteCallback implements HoodieLogFileWriteCallback 
{
+// here we distinguish log files created from log files being appended. 
Considering following scenario:
+// An appending task write to log file.
+// (1) append to existing file file_instant_writetoken1.log.1
+// (2) rollover and create file file_instant_writetoken2.log.2
+// Then this task failed and retry by a new task.
+// (3) append to existing file file_instant_writetoken1.log.1
+// (4) rollover and create file file_instant_writetoken3.log.2
+// finally file_instant_writetoken2.log.2 should not be committed to hudi, 
we use marker file to delete it.
+// keep in mind that log file is not always fail-safe unless it never roll 
over
+

Review Comment:
   > hey @guanziyue : whats your slack Id. lets chat directly in hudi workspace 
to get consensus on resolution faster
   
   Sent a DM to you on slack~~ My ID is Ziyue Guan



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9488: [HUDI-6731] BigQuerySyncTool: add flag to allow for read optimized sync for MoR tables

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9488:
URL: https://github.com/apache/hudi/pull/9488#issuecomment-1686850438

   
   ## CI report:
   
   * 47cd0ae5da72d7d6e8361cf2af2893690c63a518 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19382)
 
   * f2619c32ef7543b508b830ebeae2e6d4c1bae591 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19394)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9422:
URL: https://github.com/apache/hudi/pull/9422#issuecomment-1686838897

   
   ## CI report:
   
   * 42d026cd694d6368e45b058a4ff7a9bd36b0d3a2 UNKNOWN
   * 4b08445a6f501d2004a6dfab5a9daae7e769cf23 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19391)
 
   * aa45782a86a9632784fa95c4c0cda62c6245bcb6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19393)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9488: [HUDI-6731] BigQuerySyncTool: add flag to allow for read optimized sync for MoR tables

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9488:
URL: https://github.com/apache/hudi/pull/9488#issuecomment-1686839177

   
   ## CI report:
   
   * 47cd0ae5da72d7d6e8361cf2af2893690c63a518 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19382)
 
   * f2619c32ef7543b508b830ebeae2e6d4c1bae591 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] rahil-c opened a new issue, #9494: [SUPPORT] Querying Hudi Table via Presto Hive Connector Errors out when having DecimalType Column.

2023-08-21 Thread via GitHub


rahil-c opened a new issue, #9494:
URL: https://github.com/apache/hudi/issues/9494

   ## Issue Summary 
   
   
   Versions of apps
   * Hudi 0.13.1
   * Presto 0.281
   * Spark 3.4.0
   * glue enabled 
   
   
   * When using EMR 6.12.0 with spark to create the hudi table, and presto to 
query  with aws Glue as the catalog, hitting the following issue below. 
Creating a regular parquet table with decimal type col and querying works fine, 
whereas with hudi format seeing an issue (not sure if its truly presto related 
or 
   
   config json for enabling glue on spark and presto when creating emr cluster
   ```
   [
 {
   "Classification": "spark-hive-site",
   "Properties": {
 "hive.metastore.client.factory.class": 
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
   }
 },
 {
 "Classification": "hive-site",
 "Properties": {
   "hive.metastore.client.factory.class": 
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
   "hive.metastore.schema.verification": "false"
 }
   },
   {
   "Classification": "presto-connector-hive",
   "Properties": {
 "hive.metastore": "glue"
   }
 }
   
   ]
   
   
   ```
   
   Exception/Stack Trace
   ```
   
   java.lang.UnsupportedOperationException: 
com.facebook.presto.common.type.ShortDecimalType
   at 
com.facebook.presto.common.type.AbstractType.writeSlice(AbstractType.java:146)
   at 
com.facebook.presto.parquet.reader.BinaryColumnReader.readValue(BinaryColumnReader.java:55)
   at 
com.facebook.presto.parquet.reader.AbstractColumnReader.lambda$readValues$0(AbstractColumnReader.java:169)
   at 
com.facebook.presto.parquet.reader.AbstractColumnReader.processValues(AbstractColumnReader.java:223)
   at 
com.facebook.presto.parquet.reader.AbstractColumnReader.readValues(AbstractColumnReader.java:168)
   at 
com.facebook.presto.parquet.reader.AbstractColumnReader.readNext(AbstractColumnReader.java:148)
   at 
com.facebook.presto.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:390)
   at 
com.facebook.presto.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:557)
   at 
com.facebook.presto.parquet.reader.ParquetReader.readBlock(ParquetReader.java:540)
   at 
com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:230)
   at 
com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:208)
   at 
com.facebook.presto.common.block.LazyBlock.assureLoaded(LazyBlock.java:313)
   at 
com.facebook.presto.common.block.LazyBlock.getLoadedBlock(LazyBlock.java:304)
   at com.facebook.presto.common.Page.getLoadedPage(Page.java:314)
   at 
com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:282)
   at com.facebook.presto.operator.Driver.processInternal(Driver.java:468)
   at 
com.facebook.presto.operator.Driver.lambda$processFor$10(Driver.java:333)
   at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:782)
   at com.facebook.presto.operator.Driver.processFor(Driver.java:326)
   at 
com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1097)
   at 
com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:165)
   at 
com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:603)
   at 
com.facebook.presto.$gen.Presto_0_281_amzn_020230814_195449_1.run(Unknown 
Source)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:750)
   
   
   ```
   
   *  I see a github issue thread here 
https://github.com/prestodb/presto/issues/12016 for a similar issue
   
   
   
   ## Repro Steps
   
   1. Create hudi table via EMR Spark 
   
   ```
   
   start pyspark 
   
   pyspark --jars /usr/lib/hudi/hudi-spark-bundle.jar --conf 
"spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog"
 --conf 
"spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
   
   
   from pyspark.sql.types import *
   from decimal import Decimal
   
   test_data =[
   ("100", "2015-01-01", Decimal(1123231231.12)),
   ("101", "2015-01-01", Decimal(1123231231.12)),
   ("102", "2015-01-01", Decimal(1123231231.12)),
   ("103", "2015-01-01", Decimal(1123231231.12)),
   ("104", "2015-01-02", Decimal(1123231231.12)),
   ("105", "2015-01-02", Decimal(1123231231.12)),
   ]
   
   
   schema = StructType(
   [
   StructField("id", StringType()),
   

[GitHub] [hudi] hudi-bot commented on pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9472:
URL: https://github.com/apache/hudi/pull/9472#issuecomment-1686822578

   
   ## CI report:
   
   * 1e493605d0a26b442efbf1518b063dbb1e616872 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19390)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19389)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9482: [HUDI-6728] Update BigQuery manifest sync to support schema evolution

2023-08-21 Thread via GitHub


the-other-tim-brown commented on code in PR #9482:
URL: https://github.com/apache/hudi/pull/9482#discussion_r1300464593


##
hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/HoodieBigQuerySyncClient.java:
##
@@ -147,6 +159,31 @@ public void createManifestTable(String tableName, String 
sourceUri) {
 }
   }
 
+  /**
+   * Updates the schema for the given table if the schema has changed.
+   * @param tableName name of the table in BigQuery
+   * @param schema latest schema for the table
+   */
+  public void updateTableSchema(String tableName, Schema schema, List 
partitionFields) {
+Table existingTable = bigquery.getTable(TableId.of(projectId, datasetName, 
tableName));
+ExternalTableDefinition definition = existingTable.getDefinition();
+Schema remoteTableSchema = definition.getSchema();
+// Add the partition fields into the schema to avoid conflicts while 
updating
+List updatedTableFields = remoteTableSchema.getFields().stream()
+.filter(field -> partitionFields.contains(field.getName()))
+.collect(Collectors.toList());
+updatedTableFields.addAll(schema.getFields());
+Schema finalSchema = Schema.of(updatedTableFields);
+if (definition.getSchema() != null && 
definition.getSchema().equals(finalSchema)) {
+  return; // No need to update schema.
+}
+Table updatedTable = existingTable.toBuilder()

Review Comment:
   Yes, when updating the schema, BigQuery expects all fields including the 
partition fields to be present. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9444: [HUDI-6692] Don't default to bulk insert on nonpkless table if recordkey is omitted

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9444:
URL: https://github.com/apache/hudi/pull/9444#issuecomment-1686773173

   
   ## CI report:
   
   * 8878d3d489b39b5698fe8a83acc7f9aa313b6a98 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19359)
 
   * 20773c0c6c03fa355e200867ab3dfc7cfc36d456 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19392)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9422:
URL: https://github.com/apache/hudi/pull/9422#issuecomment-1686773107

   
   ## CI report:
   
   * 42d026cd694d6368e45b058a4ff7a9bd36b0d3a2 UNKNOWN
   * f7c426b7906a2d64aaabbe96c6d9a011ab9b441a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19357)
 
   * 4b08445a6f501d2004a6dfab5a9daae7e769cf23 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19391)
 
   * aa45782a86a9632784fa95c4c0cda62c6245bcb6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] rmahindra123 commented on a diff in pull request #9488: [HUDI-6731] BigQuerySyncTool: add flag to allow for read optimized sync for MoR tables

2023-08-21 Thread via GitHub


rmahindra123 commented on code in PR #9488:
URL: https://github.com/apache/hudi/pull/9488#discussion_r1300458207


##
hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java:
##
@@ -128,6 +128,12 @@ public class BigQuerySyncConfig extends HoodieSyncConfig 
implements Serializable
   .withDocumentation("Assume standard /mm/dd partitioning, this"
   + " exists to support backward compatibility. If you use hoodie 
0.3.x, do not set this parameter");
 
+  public static final ConfigProperty 
BIGQUERY_SYNC_ALLOW_READ_OPTIMIZED_SYNC = ConfigProperty
+  .key("hoodie.gcp.bigquery.sync.allow_read_optimized_sync")
+  .defaultValue(false)

Review Comment:
   yeah i Agree, we should just call it out in the docs. Otherwise lgtm, 
@codope you can help merge when ready



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] vinishjail97 commented on a diff in pull request #9482: [HUDI-6728] Update BigQuery manifest sync to support schema evolution

2023-08-21 Thread via GitHub


vinishjail97 commented on code in PR #9482:
URL: https://github.com/apache/hudi/pull/9482#discussion_r1300451056


##
hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/HoodieBigQuerySyncClient.java:
##
@@ -147,6 +159,31 @@ public void createManifestTable(String tableName, String 
sourceUri) {
 }
   }
 
+  /**
+   * Updates the schema for the given table if the schema has changed.
+   * @param tableName name of the table in BigQuery
+   * @param schema latest schema for the table
+   */
+  public void updateTableSchema(String tableName, Schema schema, List 
partitionFields) {
+Table existingTable = bigquery.getTable(TableId.of(projectId, datasetName, 
tableName));
+ExternalTableDefinition definition = existingTable.getDefinition();
+Schema remoteTableSchema = definition.getSchema();
+// Add the partition fields into the schema to avoid conflicts while 
updating
+List updatedTableFields = remoteTableSchema.getFields().stream()
+.filter(field -> partitionFields.contains(field.getName()))
+.collect(Collectors.toList());
+updatedTableFields.addAll(schema.getFields());
+Schema finalSchema = Schema.of(updatedTableFields);
+if (definition.getSchema() != null && 
definition.getSchema().equals(finalSchema)) {
+  return; // No need to update schema.
+}
+Table updatedTable = existingTable.toBuilder()

Review Comment:
   Clarification: We are creating the table by providing a schema without 
partition fields, but updating it with the schema by adding the partition 
fields, are we sure it's the right thing ? Without manifests and using 
ExternalTableDefinition, we use the schema without partition fields. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-21 Thread via GitHub


jonvex commented on code in PR #9422:
URL: https://github.com/apache/hudi/pull/9422#discussion_r1300455535


##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestMORColstats.java:
##
@@ -237,24 +232,27 @@ private void 
testBaseFileAndLogFileUpdateMatchesHelper(Boolean shouldAsyncCompac
 doWrite(updatedRecord);
 if (shouldRollback) {
   deleteLatestDeltacommit();
+  enableInlineCompaction(shouldInlineCompact);

Review Comment:
   It's because we rollback the 3rd commit, so we want it to compact on the 
second 3rd commit so that's why we enable it just before



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-21 Thread via GitHub


jonvex commented on code in PR #9422:
URL: https://github.com/apache/hudi/pull/9422#discussion_r1300454813


##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestMORColstats.java:
##
@@ -0,0 +1,481 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.functional;
+
+import org.apache.hudi.DataSourceReadOptions;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.client.SparkRDDWriteClient;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.testutils.HoodieTestDataGenerator;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieCompactionConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.testutils.HoodieSparkClientTestBase;
+
+import org.apache.spark.SparkException;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Properties;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static 
org.apache.hudi.common.testutils.RawTripTestPayload.recordToString;
+import static 
org.apache.hudi.config.HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS;
+import static org.apache.spark.sql.SaveMode.Append;
+import static org.apache.spark.sql.SaveMode.Overwrite;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertThrows;
+
+/**
+ * Test mor with colstats enabled in scenarios to ensure that files
+ * are being appropriately read or not read.
+ * The strategy employed is to corrupt targeted base files. If we want
+ * to prove the file is read, we assert that an exception will be thrown.
+ * If we want to prove the file is not read, we expect the read to
+ * successfully execute.
+ */
+public class TestMORColstats extends HoodieSparkClientTestBase {
+
+  private static String matchCond = "trip_type = 'UBERX'";
+  private static String nonMatchCond = "trip_type = 'BLACK'";
+  private static String[] dropColumns = {"_hoodie_commit_time", 
"_hoodie_commit_seqno",
+  "_hoodie_record_key", "_hoodie_partition_path", "_hoodie_file_name"};
+
+  private Boolean shouldOverwrite;
+  Map options;
+  @TempDir
+  public java.nio.file.Path basePath;
+
+  @BeforeEach
+  public void setUp() throws Exception {
+initSparkContexts();
+dataGen = new HoodieTestDataGenerator();
+shouldOverwrite = true;
+options = getOptions();
+Properties props = new Properties();
+props.putAll(options);
+try {
+  metaClient = HoodieTableMetaClient.initTableAndGetMetaClient(hadoopConf, 
basePath.toString(), props);
+} catch (IOException e) {
+  throw new RuntimeException(e);
+}
+  }
+
+  @AfterEach
+  public void tearDown() throws IOException {
+cleanupSparkContexts();
+cleanupTestDataGenerator();
+metaClient = null;
+  }
+
+  /**
+   * Create two files, one should be excluded by colstats
+   */
+  @Test
+  public void testBaseFileOnly() {
+Dataset inserts = makeInsertDf("000", 100);
+Dataset batch1 = inserts.where(matchCond);
+Dataset batch2 = inserts.where(nonMatchCond);
+doWrite(batch1);
+doWrite(batch2);
+List filesToCorrupt = getFilesToCorrupt();
+assertEquals(1, filesToCorrupt.size());
+filesToCorrupt.forEach(TestMORColstats::corruptFile);
+assertEquals(0, readMatchingRecords().except(batch1).count());
+//Read without data skipping to show that it will fail
+//Reading with data skipping succeeded so that means that data skipping is 
working and the corrupted
+//file was 

[GitHub] [hudi] hudi-bot commented on pull request #9444: [HUDI-6692] Don't default to bulk insert on nonpkless table if recordkey is omitted

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9444:
URL: https://github.com/apache/hudi/pull/9444#issuecomment-1686762253

   
   ## CI report:
   
   * 8878d3d489b39b5698fe8a83acc7f9aa313b6a98 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19359)
 
   * 20773c0c6c03fa355e200867ab3dfc7cfc36d456 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9488: [HUDI-6731] BigQuerySyncTool: add flag to allow for read optimized sync for MoR tables

2023-08-21 Thread via GitHub


the-other-tim-brown commented on code in PR #9488:
URL: https://github.com/apache/hudi/pull/9488#discussion_r1300449186


##
hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java:
##
@@ -128,6 +128,12 @@ public class BigQuerySyncConfig extends HoodieSyncConfig 
implements Serializable
   .withDocumentation("Assume standard /mm/dd partitioning, this"
   + " exists to support backward compatibility. If you use hoodie 
0.3.x, do not set this parameter");
 
+  public static final ConfigProperty 
BIGQUERY_SYNC_ALLOW_READ_OPTIMIZED_SYNC = ConfigProperty
+  .key("hoodie.gcp.bigquery.sync.allow_read_optimized_sync")
+  .defaultValue(false)

Review Comment:
   Ok sounds good, I have a todo on my list for this week to go through the 
docs and make sure they're updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] codope commented on a diff in pull request #9488: [HUDI-6731] BigQuerySyncTool: add flag to allow for read optimized sync for MoR tables

2023-08-21 Thread via GitHub


codope commented on code in PR #9488:
URL: https://github.com/apache/hudi/pull/9488#discussion_r1300434964


##
hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java:
##
@@ -128,6 +128,12 @@ public class BigQuerySyncConfig extends HoodieSyncConfig 
implements Serializable
   .withDocumentation("Assume standard /mm/dd partitioning, this"
   + " exists to support backward compatibility. If you use hoodie 
0.3.x, do not set this parameter");
 
+  public static final ConfigProperty 
BIGQUERY_SYNC_ALLOW_READ_OPTIMIZED_SYNC = ConfigProperty
+  .key("hoodie.gcp.bigquery.sync.allow_read_optimized_sync")
+  .defaultValue(false)

Review Comment:
   Let's remove the flag. Don't even need to log a warning. We can call out in 
the docs that RO queries will be supported for BQ.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ehurheap closed issue #9480: [SUPPORT] create savepoint fails with OutOfMemoryError

2023-08-21 Thread via GitHub


ehurheap closed issue #9480: [SUPPORT] create savepoint fails with 
OutOfMemoryError
URL: https://github.com/apache/hudi/issues/9480


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9488: [HUDI-6731] BigQuerySyncTool: add flag to allow for read optimized sync for MoR tables

2023-08-21 Thread via GitHub


the-other-tim-brown commented on code in PR #9488:
URL: https://github.com/apache/hudi/pull/9488#discussion_r1300367026


##
hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java:
##
@@ -128,6 +128,12 @@ public class BigQuerySyncConfig extends HoodieSyncConfig 
implements Serializable
   .withDocumentation("Assume standard /mm/dd partitioning, this"
   + " exists to support backward compatibility. If you use hoodie 
0.3.x, do not set this parameter");
 
+  public static final ConfigProperty 
BIGQUERY_SYNC_ALLOW_READ_OPTIMIZED_SYNC = ConfigProperty
+  .key("hoodie.gcp.bigquery.sync.allow_read_optimized_sync")
+  .defaultValue(false)

Review Comment:
   @codope do we even need this flag or will it be enough to just log some 
warning and update documentation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-6712) Implement optimized keyed lookup on parquet files

2023-08-21 Thread Lin Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757003#comment-17757003
 ] 

Lin Liu commented on HUDI-6712:
---

[~rmahindra] sent some PRs to review for the context. Will finish reading them 
and start write design doc today.

> Implement optimized keyed lookup on parquet files
> -
>
> Key: HUDI-6712
> URL: https://issues.apache.org/jira/browse/HUDI-6712
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Vinoth Chandar
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> Parquet performs poorly when performing a lookup of specific records, based 
> on a single key lookup column. 
> e.g: select * from parquet where key in ("a","b", "c) (SQL)
> e.g: List lookup(parquetFile, Set keys) (code) 
> Let's implement a reader, that is optimized for this pattern, by scanning 
> least amount of data. 
> Requirements: 
> 1. Need to support multiple values for same key. 
> 2. Can assume the file is sorted by the key/lookup field. 
> 3. Should handle non-existence of keys.
> 4. Should leverage parquet metadata (bloom filters, column index, ... ) to 
> minimize read read. 
> 5. Must to the minimum about of RPC calls to cloud storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6701) Explore use of UUID-6/7 as a replacement for current auto generated keys

2023-08-21 Thread Lin Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757001#comment-17757001
 ] 

Lin Liu commented on HUDI-6701:
---

Will discuss the next step in today's sync up.

> Explore use of UUID-6/7 as a replacement for current auto generated keys
> 
>
> Key: HUDI-6701
> URL: https://issues.apache.org/jira/browse/HUDI-6701
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> Today, we auto generate string keys of the form 
> (HoodieRecord#generateSequenceId), which is highly compressible, esp compared 
> to uuidv1, when we store as a string column inside a parquet file.
> {code:java}
>   public static String generateSequenceId(String instantTime, int 
> partitionId, long recordIndex) {
> return instantTime + "_" + partitionId + "_" + recordIndex;
>   }
> {code}
> As a part of this task, we'd love to understand if 
> - Can uuid6 or 7, provide similar compressed storage footprint when written 
> as a column in a parquet file. 
> - can the current format be represented as a 160-bit number i.e 2 longs, 1 
> int in storage? would that save us further in storage costs?  
> (Orthogonal consideration is the memory needed to hold the key string, which 
> can be higher than a 160bits. We can discuss this later, once we understand 
> storage footprint) 
>  
> Resources:
> * https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/09/ 
> * https://github.com/uuid6/uuid6-ietf-draft
> * https://github.com/uuid6/prototypes 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-4756) Clean up usages of "assume.date.partition" config within hudi

2023-08-21 Thread Lin Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-4756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757002#comment-17757002
 ] 

Lin Liu commented on HUDI-4756:
---

Today I will fix all the test failures and send for review again.

> Clean up usages of "assume.date.partition" config within hudi
> -
>
> Key: HUDI-4756
> URL: https://issues.apache.org/jira/browse/HUDI-4756
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: configs
>Reporter: sivabalan narayanan
>Assignee: Lin Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> looks like "assume.date.partition" is not used anywhere within hudi. lets 
> clean up the usages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9422:
URL: https://github.com/apache/hudi/pull/9422#issuecomment-1686583717

   
   ## CI report:
   
   * 42d026cd694d6368e45b058a4ff7a9bd36b0d3a2 UNKNOWN
   * f7c426b7906a2d64aaabbe96c6d9a011ab9b441a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19357)
 
   * 4b08445a6f501d2004a6dfab5a9daae7e769cf23 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19391)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9422:
URL: https://github.com/apache/hudi/pull/9422#issuecomment-1686569043

   
   ## CI report:
   
   * 42d026cd694d6368e45b058a4ff7a9bd36b0d3a2 UNKNOWN
   * f7c426b7906a2d64aaabbe96c6d9a011ab9b441a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19357)
 
   * 4b08445a6f501d2004a6dfab5a9daae7e769cf23 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Prasagnya commented on issue #9493: Hudi DELETE_PARTITIONS operation doesn't delete partitions via Spark Data Frame

2023-08-21 Thread via GitHub


Prasagnya commented on issue #9493:
URL: https://github.com/apache/hudi/issues/9493#issuecomment-1686516725

   Hey @ad1happy2go , Is there anyway _I can overwrite a specific partition_?
   I came down to this delete usecase from my usecase of overwriting one 
partition on a given week day. Otherwise it would be simple upsert.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-21 Thread via GitHub


jonvex commented on code in PR #9422:
URL: https://github.com/apache/hudi/pull/9422#discussion_r1300217569


##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestMORColstats.java:
##
@@ -237,24 +232,27 @@ private void 
testBaseFileAndLogFileUpdateMatchesHelper(Boolean shouldAsyncCompac
 doWrite(updatedRecord);
 if (shouldRollback) {
   deleteLatestDeltacommit();
+  enableInlineCompaction(shouldInlineCompact);

Review Comment:
   That's how I was originally tying to do it, but the behavior with the 
rollback was strange. If I set `hoodie.compact.inline.max.delta.commits` to 4 
the compaction was not happening, but if I set it to 3 it would happen too 
early. So that's why I set the value to 1 and do it right before the last write



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-21 Thread via GitHub


jonvex commented on code in PR #9422:
URL: https://github.com/apache/hudi/pull/9422#discussion_r1300213317


##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestMORColstats.java:
##
@@ -202,10 +246,23 @@ private void 
testBaseFileAndLogFileUpdateMatchesHelper(Boolean shouldAsyncCompac
   assertEquals(0, 
readMatchingRecords().except(batch1.union(updatedRecord)).count());
 }
 
-//Corrupt to prove that colstats does not exclude filegroup
-filesToCorrupt.forEach(TestMORColstats::corruptFile);
-assertEquals(1, filesToCorrupt.size());
-assertThrows(SparkException.class, () -> readMatchingRecords().count());
+if (shouldExecuteCompaction) {
+  doCompaction();
+  filesToCorrupt = getFilesToCorrupt();
+  filesToCorrupt.forEach(TestMORColstats::corruptFile);
+  if (shouldDelete || shouldRollback) {
+//we corrupt both files in the fg
+assertEquals(2, filesToCorrupt.size());

Review Comment:
   It was a lot easier to write the getFilesToCorrupt method that way, because 
of the way "_hoodie_file_name" for log files is. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9472:
URL: https://github.com/apache/hudi/pull/9472#issuecomment-1686459322

   
   ## CI report:
   
   * 1e493605d0a26b442efbf1518b063dbb1e616872 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19390)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19389)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] codope closed issue #9348: [SUPPORT] hide soft-deleted rows

2023-08-21 Thread via GitHub


codope closed issue #9348: [SUPPORT] hide soft-deleted rows
URL: https://github.com/apache/hudi/issues/9348


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ad1happy2go commented on issue #9351: [SUPPORT] The point query performance after clustering is lags behind Delta Lake.

2023-08-21 Thread via GitHub


ad1happy2go commented on issue #9351:
URL: https://github.com/apache/hudi/issues/9351#issuecomment-1686456616

   @zbbkeepgoing Ideally delta and hudi should ideally be scanning similar 
number of files if both are skipping files due to column stats. Can you confirm 
if hudi reading all the files under the time partition column you are using in 
the query. It may happen that hudi is not skipping files using col stats at all 
and just reading files from entire one partition. 
   
   we can also sync up on a huddle on hudi community slack in case you want to 
look together. Ping me (Aditya Goenka)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ad1happy2go commented on issue #9348: [SUPPORT] hide soft-deleted rows

2023-08-21 Thread via GitHub


ad1happy2go commented on issue #9348:
URL: https://github.com/apache/hudi/issues/9348#issuecomment-1686457776

   @ys8 Closing out this issue. Please let us know or reopen in case of any 
concerns.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-21 Thread via GitHub


jonvex commented on code in PR #9422:
URL: https://github.com/apache/hudi/pull/9422#discussion_r1300201256


##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestMORColstats.java:
##
@@ -0,0 +1,481 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.functional;
+
+import org.apache.hudi.DataSourceReadOptions;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.client.SparkRDDWriteClient;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.testutils.HoodieTestDataGenerator;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieCompactionConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.testutils.HoodieSparkClientTestBase;
+
+import org.apache.spark.SparkException;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Properties;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static 
org.apache.hudi.common.testutils.RawTripTestPayload.recordToString;
+import static 
org.apache.hudi.config.HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS;
+import static org.apache.spark.sql.SaveMode.Append;
+import static org.apache.spark.sql.SaveMode.Overwrite;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertThrows;
+
+/**
+ * Test mor with colstats enabled in scenarios to ensure that files
+ * are being appropriately read or not read.
+ * The strategy employed is to corrupt targeted base files. If we want
+ * to prove the file is read, we assert that an exception will be thrown.
+ * If we want to prove the file is not read, we expect the read to
+ * successfully execute.
+ */
+public class TestMORColstats extends HoodieSparkClientTestBase {
+
+  private static String matchCond = "trip_type = 'UBERX'";
+  private static String nonMatchCond = "trip_type = 'BLACK'";
+  private static String[] dropColumns = {"_hoodie_commit_time", 
"_hoodie_commit_seqno",
+  "_hoodie_record_key", "_hoodie_partition_path", "_hoodie_file_name"};
+
+  private Boolean shouldOverwrite;
+  Map options;
+  @TempDir
+  public java.nio.file.Path basePath;
+
+  @BeforeEach
+  public void setUp() throws Exception {
+initSparkContexts();
+dataGen = new HoodieTestDataGenerator();
+shouldOverwrite = true;
+options = getOptions();
+Properties props = new Properties();
+props.putAll(options);
+try {
+  metaClient = HoodieTableMetaClient.initTableAndGetMetaClient(hadoopConf, 
basePath.toString(), props);
+} catch (IOException e) {
+  throw new RuntimeException(e);
+}
+  }
+
+  @AfterEach
+  public void tearDown() throws IOException {
+cleanupSparkContexts();
+cleanupTestDataGenerator();
+metaClient = null;
+  }
+
+  /**
+   * Create two files, one should be excluded by colstats
+   */
+  @Test
+  public void testBaseFileOnly() {
+Dataset inserts = makeInsertDf("000", 100);
+Dataset batch1 = inserts.where(matchCond);
+Dataset batch2 = inserts.where(nonMatchCond);
+doWrite(batch1);
+doWrite(batch2);
+List filesToCorrupt = getFilesToCorrupt();
+assertEquals(1, filesToCorrupt.size());
+filesToCorrupt.forEach(TestMORColstats::corruptFile);
+assertEquals(0, readMatchingRecords().except(batch1).count());
+//Read without data skipping to show that it will fail
+//Reading with data skipping succeeded so that means that data skipping is 
working and the corrupted
+//file was 

[GitHub] [hudi] hudi-bot commented on pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.

2023-08-21 Thread via GitHub


hudi-bot commented on PR #9472:
URL: https://github.com/apache/hudi/pull/9472#issuecomment-1686443773

   
   ## CI report:
   
   * c0019c0fc1d1803b9e0ccfbd1c9de953d6aba4f1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19379)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19383)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19388)
 
   * 1e493605d0a26b442efbf1518b063dbb1e616872 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19389)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19390)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ad1happy2go commented on issue #9354: [SUPPORT] HoodieDeltaStreamer fails to load org.apache.spark.sql.execution.datasources.Spark33NestedSchemaPruning

2023-08-21 Thread via GitHub


ad1happy2go commented on issue #9354:
URL: https://github.com/apache/hudi/issues/9354#issuecomment-1686428832

   @andreacfm Sorry for the delay in response here. You should use JDK 1.8 to 
compile the code. looks like you are using some later version of java.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ad1happy2go commented on issue #9493: Hudi DELETE_PARTITIONS operation doesn't delete partitions via Spark Data Frame

2023-08-21 Thread via GitHub


ad1happy2go commented on issue #9493:
URL: https://github.com/apache/hudi/issues/9493#issuecomment-1686418063

   @Prasagnya What are the cleaner configurations you are using. If the default 
setting is in use, data cleaning will occur once there are 10 or more 
subsequent commits.
   
   It's possible to employ multiple cleaner configurations, outlined in detail 
here: https://hudi.apache.org/docs/hoodie_cleaner
   
   Alternatively, you can manually run HoodieCleaner, specifying a retention of 
just 1 commit, along with custom configurations, if you wish to initiate data 
cleaning. However, please be aware that this action may result in the loss of 
historical data. Let us know in case you have further doubts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >