[GitHub] [hudi] hudi-bot commented on pull request #9317: [MINOR] Simplify CreateHoodieTableCommand logWarning

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9317:
URL: https://github.com/apache/hudi/pull/9317#issuecomment-1666742342

   
   ## CI report:
   
   * 7ecfd3c7230df1aea3db5a1ffd4463d1b25f16d7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18944)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19111)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19122)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync (#9374)

2023-08-05 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new cecd79e947b [HUDI-6652] Implement basePath-level synchronization in 
runHoodieMetaSync (#9374)
cecd79e947b is described below

commit cecd79e947b3a274a26e72a1f9c730323393f8f9
Author: Sagar Sumit 
AuthorDate: Sun Aug 6 11:34:46 2023 +0530

[HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync 
(#9374)
---
 .../hudi/sync/common/util/SyncUtilHelpers.java | 65 +++---
 .../hudi/sync/common/util/TestSyncUtilHelpers.java | 64 +
 2 files changed, 108 insertions(+), 21 deletions(-)

diff --git 
a/hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SyncUtilHelpers.java
 
b/hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SyncUtilHelpers.java
index 466c0c71ffa..440baa86c0c 100644
--- 
a/hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SyncUtilHelpers.java
+++ 
b/hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SyncUtilHelpers.java
@@ -23,7 +23,6 @@ import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.common.util.ReflectionUtils;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.exception.HoodieMetaSyncException;
-import org.apache.hudi.sync.common.HoodieSyncConfig;
 import org.apache.hudi.sync.common.HoodieSyncTool;
 
 import org.apache.hadoop.conf.Configuration;
@@ -33,36 +32,56 @@ import org.slf4j.LoggerFactory;
 
 import java.util.Map;
 import java.util.Properties;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.locks.Lock;
+import java.util.concurrent.locks.ReentrantLock;
+
+import static 
org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT;
+import static org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_BASE_PATH;
+import static 
org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_TABLE_NAME;
 
 /**
  * Helper class for syncing Hudi commit data with external metastores.
  */
 public class SyncUtilHelpers {
   private static final Logger LOG = 
LoggerFactory.getLogger(SyncUtilHelpers.class);
+
+  // Locks for each table (base path) to avoid concurrent modification of the 
same underneath meta storage.
+  // Meta store such as Hive may encounter {@code 
ConcurrentModificationException} for #alter_table.
+  private static final ConcurrentHashMap TABLE_LOCKS = new 
ConcurrentHashMap<>();
+
   /**
* Create an instance of an implementation of {@link HoodieSyncTool} that 
will sync all the relevant meta information
* with an external metastore such as Hive etc. to ensure Hoodie tables can 
be queried or read via external systems.
*
-   * IMPORTANT: make this method class level thread safe to avoid 
concurrent modification of the same underneath meta storage.
-   * Meta store such as Hive may encounter {@code 
ConcurrentModificationException} for #alter_table.
-   *
-   * @param syncToolClassName   Class name of the {@link HoodieSyncTool} 
implementation.
-   * @param props   property map.
-   * @param hadoopConfigHadoop confs.
-   * @param fs  Filesystem used.
-   * @param targetBasePath  The target base path that contains the hoodie 
table.
-   * @param baseFileFormat  The file format used by the hoodie table 
(defaults to PARQUET).
+   * @param syncToolClassName Class name of the {@link HoodieSyncTool} 
implementation.
+   * @param props property map.
+   * @param hadoopConfig  Hadoop confs.
+   * @param fsFilesystem used.
+   * @param targetBasePathThe target base path that contains the hoodie 
table.
+   * @param baseFileFormatThe file format used by the hoodie table 
(defaults to PARQUET).
*/
-  public static synchronized void runHoodieMetaSync(String syncToolClassName,
+  public static void runHoodieMetaSync(String syncToolClassName,
TypedProperties props,
Configuration hadoopConfig,
FileSystem fs,
String targetBasePath,
String baseFileFormat) {
-try (HoodieSyncTool syncTool = instantiateMetaSyncTool(syncToolClassName, 
props, hadoopConfig, fs, targetBasePath, baseFileFormat)) {
-  syncTool.syncHoodieTable();
-} catch (Throwable e) {
-  throw new HoodieMetaSyncException("Could not sync using the meta sync 
class " + syncToolClassName, e);
+if (targetBasePath == null) {
+  throw new IllegalArgumentException("Target base path must not be null");
+}
+
+// Get or create a lock for the specific table
+Lock tableLock = TABLE_LOCKS.computeIfAbsent(targetBasePath, k -> new 
Reentrant

[GitHub] [hudi] codope merged pull request #9374: [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync

2023-08-05 Thread via GitHub


codope merged PR #9374:
URL: https://github.com/apache/hudi/pull/9374


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files

2023-08-05 Thread via GitHub


nsivabalan commented on code in PR #9367:
URL: https://github.com/apache/hudi/pull/9367#discussion_r1285153000


##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestExternalPathHandling.java:
##
@@ -0,0 +1,293 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.functional;
+
+import org.apache.hudi.avro.model.HoodieActionInstant;
+import org.apache.hudi.avro.model.HoodieCleanFileInfo;
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.avro.model.HoodieCleanerPlan;
+import org.apache.hudi.avro.model.HoodieMetadataColumnStats;
+import org.apache.hudi.avro.model.IntWrapper;
+import org.apache.hudi.avro.model.StringWrapper;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.HoodieCleanStat;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieCleaningPolicy;
+import org.apache.hudi.common.model.HoodieColumnRangeMetadata;
+import org.apache.hudi.common.model.HoodieDeltaWriteStat;
+import org.apache.hudi.common.model.HoodieFileGroup;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.CleanerUtils;
+import org.apache.hudi.common.util.ExternalFilePathUtil;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieArchivalConfig;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.metadata.HoodieBackedTableMetadata;
+import org.apache.hudi.metadata.HoodieMetadataFileSystemView;
+import org.apache.hudi.metadata.HoodieTableMetadataWriter;
+import org.apache.hudi.table.action.clean.CleanPlanner;
+import org.apache.hudi.testutils.HoodieClientTestBase;
+
+import org.apache.spark.api.java.JavaRDD;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.index.HoodieIndex.IndexType.INMEMORY;
+
+/**
+ * Asserts that tables initialized from file paths created outside Hudi can 
properly be loaded.
+ */
+public class TestExternalPathHandling extends HoodieClientTestBase {
+
+  private static final String FIELD_1 = "field1";
+  private static final String FIELD_2 = "field2";
+  private HoodieWriteConfig writeConfig;
+
+  @ParameterizedTest
+  @MethodSource("getArgs")
+  public void testFlow(FileIdAndNameGenerator fileIdAndNameGenerator, 
List partitions) throws Exception {
+metaClient = HoodieTableMetaClient.reload(metaClient);
+writeConfig = HoodieWriteConfig.newBuilder()
+
.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(INMEMORY).build())
+.withPath(metaClient.getBasePathV2().toString())
+.withEmbeddedTimelineServerEnabled(false)
+.withMetadataConfig(HoodieMetadataConfig.newBuilder()
+.withMaxNumDeltaCommitsBeforeCompaction(2)
+.withFileSystemBootstrapDisabled(true)
+.enable(true)
+.withMetadataIndexColumnStats(true)
+.withColumnStatsIndexForColumns(FIELD_1 + "," + FIELD_2)
+.build())
+
.withArchivalConfig(HoodieArchivalConfig.newBuilder().archiveCommitsWith(1, 
2).build())
+.withTableServicesEnabled(true)
+.build();
+
+writeClient = getHoodieWriteClient(writeConfig);
+String instantTime1 = 
writeClient.startCommit(HoodieTimeline.R

[GitHub] [hudi] hudi-bot commented on pull request #9376: [HUDI-6654] Add new log block header type to store record positions

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9376:
URL: https://github.com/apache/hudi/pull/9376#issuecomment-1666718268

   
   ## CI report:
   
   * 156b61cd5e76559df94ddeada2d11d43c29e750c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19125)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9221: [HUDI-6550] Add Hadoop conf to HiveConf for HiveSyncConfig

2023-08-05 Thread via GitHub


danny0405 commented on code in PR #9221:
URL: https://github.com/apache/hudi/pull/9221#discussion_r1285152739


##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java:
##
@@ -98,8 +98,9 @@ public HiveSyncConfig(Properties props) {
 
   public HiveSyncConfig(Properties props, Configuration hadoopConf) {
 super(props, hadoopConf);
-HiveConf hiveConf = hadoopConf instanceof HiveConf
-? (HiveConf) hadoopConf : new HiveConf(hadoopConf, HiveConf.class);
+HiveConf hiveConf = new HiveConf();
+// HiveConf needs to load Hadoop conf to allow instantiation via 
AWSGlueClientFactory
+hiveConf.addResource(hadoopConf);

Review Comment:
   > but it's possible that there are other configs/custom configs passed in 
via Spark session,
   
   Is this a classical way people pass around hive options with spark?
   
   > An alternative solution would be always pass hadoopConf to HiveConf 
constructor
   
   Does it introduce too much overhead then?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files

2023-08-05 Thread via GitHub


nsivabalan commented on code in PR #9367:
URL: https://github.com/apache/hudi/pull/9367#discussion_r1285152730


##
hudi-common/src/main/java/org/apache/hudi/common/util/ExternalFilePathUtil.java:
##
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+/**
+ * Utility methods for handling externally created files.
+ */
+public class ExternalFilePathUtil {
+  // Suffix acts as a marker when appended to a file path that the path was 
created by an external system and not a Hudi writer.
+  private static final String EXTERNAL_FILE_SUFFIX = "_hudiext";

Review Comment:
   gotcha 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files

2023-08-05 Thread via GitHub


nsivabalan commented on code in PR #9367:
URL: https://github.com/apache/hudi/pull/9367#discussion_r1285152678


##
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##
@@ -312,6 +312,13 @@ public final class HoodieMetadataConfig extends 
HoodieConfig {
   .withDocumentation("Maximum size in bytes of a single log file. Larger 
log files can contain larger log blocks "
   + "thereby reducing the number of blocks to search for keys");
 
+  public static final ConfigProperty DISABLE_FILESYSTEM_BOOTSTRAP = 
ConfigProperty
+  .key(METADATA_PREFIX + ".filesystem.bootstrap.disabled")
+  .defaultValue(false)
+  .sinceVersion("0.14.0")
+  .withDocumentation("Disable bootstrapping metadata table from the file 
system when the table is first created. "

Review Comment:
   hoodie.metadata.auto.initialize sounds better 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated (2b34d3fddbd -> 37dd13642f5)

2023-08-05 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 2b34d3fddbd [HUDI-6646] Add default lock provider for spark offline 
compaction and clustering service (#9365)
 add 37dd13642f5 [HUDI-6386] Fix flakey multiwriter tests (#9369)

No new revisions were added by this update.

Summary of changes:
 .../hudi/client/TestHoodieClientMultiWriter.java   | 43 --
 1 file changed, 39 insertions(+), 4 deletions(-)



[GitHub] [hudi] nsivabalan merged pull request #9369: [HUDI-6386] Fix flakey multiwriter tests

2023-08-05 Thread via GitHub


nsivabalan merged PR #9369:
URL: https://github.com/apache/hudi/pull/9369


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #9255: [HUDI-6503] Make TableServiceClient's txnManager consistent with Writ…

2023-08-05 Thread via GitHub


danny0405 commented on PR #9255:
URL: https://github.com/apache/hudi/pull/9255#issuecomment-1666717120

   Let's move this out of 0.14.0 because it's only a code refactoring.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9376: [HUDI-6654] Add new log block header type to store record positions

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9376:
URL: https://github.com/apache/hudi/pull/9376#issuecomment-1666716359

   
   ## CI report:
   
   * 156b61cd5e76559df94ddeada2d11d43c29e750c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9375: [MINOR] Disable the MDT for flink by default

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9375:
URL: https://github.com/apache/hudi/pull/9375#issuecomment-1666714706

   
   ## CI report:
   
   * 99d76ee8e806c5cbf6cffc511e0083a747b3e94c UNKNOWN
   * 918c99b3221af6f081abd37c9d9494767aab0906 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19123)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files

2023-08-05 Thread via GitHub


vinothchandar commented on code in PR #9367:
URL: https://github.com/apache/hudi/pull/9367#discussion_r1285149102


##
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##
@@ -440,6 +447,10 @@ public int getRecordIndexMaxParallelism() {
 return getInt(RECORD_INDEX_MAX_PARALLELISM);
   }
 
+  public boolean isFileSystemBootstrapDisabled() {

Review Comment:
   to be renamed along with config name change. 



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -616,6 +616,9 @@ private HoodieTableMetaClient initializeMetaClient() throws 
IOException {
* @return List consisting of {@code DirectoryInfo} for each partition found.
*/
   private List listAllPartitionsFromFilesystem(String 
initializationTime) {
+if (dataWriteConfig.getMetadataConfig().isFileSystemBootstrapDisabled()) {

Review Comment:
   food for thought : I think this should prevent any MDT partitions from being 
initialized, not just `files`



##
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##
@@ -312,6 +312,13 @@ public final class HoodieMetadataConfig extends 
HoodieConfig {
   .withDocumentation("Maximum size in bytes of a single log file. Larger 
log files can contain larger log blocks "
   + "thereby reducing the number of blocks to search for keys");
 
+  public static final ConfigProperty DISABLE_FILESYSTEM_BOOTSTRAP = 
ConfigProperty
+  .key(METADATA_PREFIX + ".filesystem.bootstrap.disabled")
+  .defaultValue(false)
+  .sinceVersion("0.14.0")
+  .withDocumentation("Disable bootstrapping metadata table from the file 
system when the table is first created. "

Review Comment:
   We can mark this advanced. (@yihua to confirm) . 
   
   On naming : just `hoodie.metadata.bootstrap.disabled` or 
`hoodie.metadata.auto.initialize` = true (default)| false?
   Thats the best I can think of.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6654) Add new log block header type to store record positions

2023-08-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6654:
-
Labels: pull-request-available  (was: )

> Add new log block header type to store record positions
> ---
>
> Key: HUDI-6654
> URL: https://issues.apache.org/jira/browse/HUDI-6654
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> To support position-based merging of base and log files, we need to encode 
> positions in the log blocks so that the positions can be used directly, 
> without having to deserialize records or delete keys for OverwriteWithLatest 
> payload, or with ordering values required only for 
> `DefaultHoodieRecordPayload` supporting event time based streaming.  We add a 
> new `HeaderMetadataType` to store the positions in the log block header.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] yihua opened a new pull request, #9376: [HUDI-6654] Add new log block header type to store record positions

2023-08-05 Thread via GitHub


yihua opened a new pull request, #9376:
URL: https://github.com/apache/hudi/pull/9376

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6654) Add new log block header type to store record positions

2023-08-05 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6654:

Summary: Add new log block header type to store record positions  (was: Add 
new log blocker header type to store record positions)

> Add new log block header type to store record positions
> ---
>
> Key: HUDI-6654
> URL: https://issues.apache.org/jira/browse/HUDI-6654
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
>
> To support position-based merging of base and log files, we need to encode 
> positions in the log blocks so that the positions can be used directly, 
> without having to deserialize records or delete keys for OverwriteWithLatest 
> payload, or with ordering values required only for 
> `DefaultHoodieRecordPayload` supporting event time based streaming.  We add a 
> new `HeaderMetadataType` to store the positions in the log block header.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6654) Add new log blocker header type to store record positions

2023-08-05 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6654:

Summary: Add new log blocker header type to store record positions  (was: 
Encode record positions in log blocks)

> Add new log blocker header type to store record positions
> -
>
> Key: HUDI-6654
> URL: https://issues.apache.org/jira/browse/HUDI-6654
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
>
> To support position-based merging of base and log files, we need to encode 
> positions in the log blocks so that the positions can be used directly, 
> without having to deserialize records or delete keys for OverwriteWithLatest 
> payload, or with ordering values required only for 
> `DefaultHoodieRecordPayload` supporting event time based streaming.  We add a 
> new `HeaderMetadataType` to store the positions in the log block header.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6654) Encode record positions in log blocks

2023-08-05 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6654:

Description: To support position-based merging of base and log files, we 
need to encode positions in the log blocks so that the positions can be used 
directly, without having to deserialize records or delete keys for 
OverwriteWithLatest payload, or with ordering values required only for 
`DefaultHoodieRecordPayload` supporting event time based streaming.  We add a 
new `HeaderMetadataType` to store the positions in the log block header.  (was: 
To support position-based merging of base and log files, we need to encode 
positions in the log blocks so that the positions can be used directly, without 
having to deserialize records or delete keys for OverwriteWithLatest payload, 
or with ordering values required only for `DefaultHoodieRecordPayload` 
supporting event time based streaming.)

> Encode record positions in log blocks
> -
>
> Key: HUDI-6654
> URL: https://issues.apache.org/jira/browse/HUDI-6654
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
>
> To support position-based merging of base and log files, we need to encode 
> positions in the log blocks so that the positions can be used directly, 
> without having to deserialize records or delete keys for OverwriteWithLatest 
> payload, or with ordering values required only for 
> `DefaultHoodieRecordPayload` supporting event time based streaming.  We add a 
> new `HeaderMetadataType` to store the positions in the log block header.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #9375: [MINOR] Disable the MDT for flink by default

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9375:
URL: https://github.com/apache/hudi/pull/9375#issuecomment-1666702367

   
   ## CI report:
   
   * 99d76ee8e806c5cbf6cffc511e0083a747b3e94c UNKNOWN
   * 918c99b3221af6f081abd37c9d9494767aab0906 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9375: [MINOR] Disable the MDT for flink by default

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9375:
URL: https://github.com/apache/hudi/pull/9375#issuecomment-1666700777

   
   ## CI report:
   
   * 99d76ee8e806c5cbf6cffc511e0083a747b3e94c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9374: [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9374:
URL: https://github.com/apache/hudi/pull/9374#issuecomment-1666700762

   
   ## CI report:
   
   * ee5d3013bf73645648b344a468404dd076e80003 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19114)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9317: [MINOR] Simplify CreateHoodieTableCommand logWarning

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9317:
URL: https://github.com/apache/hudi/pull/9317#issuecomment-1666700714

   
   ## CI report:
   
   * 7ecfd3c7230df1aea3db5a1ffd4463d1b25f16d7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18944)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19111)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19122)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] cxzl25 commented on pull request #9317: [MINOR] Simplify CreateHoodieTableCommand logWarning

2023-08-05 Thread via GitHub


cxzl25 commented on PR #9317:
URL: https://github.com/apache/hudi/pull/9317#issuecomment-1666700500

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6646) Add default lock provider for spark offline compaction and clustering service

2023-08-05 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-6646:
-
Fix Version/s: 0.14.0
   (was: 1.0.0)

> Add default lock provider for spark offline compaction and clustering service
> -
>
> Key: HUDI-6646
> URL: https://issues.apache.org/jira/browse/HUDI-6646
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6646) Add default lock provider for spark offline compaction and clustering service

2023-08-05 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6646.

Resolution: Fixed

Fixed via master branch: 2b34d3fddbd757a9ff77dd0ad2c67f0d04b03e4f

> Add default lock provider for spark offline compaction and clustering service
> -
>
> Key: HUDI-6646
> URL: https://issues.apache.org/jira/browse/HUDI-6646
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [HUDI-6646] Add default lock provider for spark offline compaction and clustering service (#9365)

2023-08-05 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 2b34d3fddbd [HUDI-6646] Add default lock provider for spark offline 
compaction and clustering service (#9365)
2b34d3fddbd is described below

commit 2b34d3fddbd757a9ff77dd0ad2c67f0d04b03e4f
Author: Danny Chan 
AuthorDate: Sun Aug 6 11:43:03 2023 +0800

[HUDI-6646] Add default lock provider for spark offline compaction and 
clustering service (#9365)

The default in progress lock provider does not work well for multi-engine 
interations.
Config a file system lock provider which shares the same lock path for 
engines.
---
 .../lock/FileSystemBasedLockProvider.java  | 25 
 .../org/apache/hudi/util/FlinkWriteClients.java|  6 +--
 .../java/org/apache/hudi/util/StreamerUtil.java|  8 
 .../scala/org/apache/hudi/HoodieCLIUtils.scala |  9 -
 .../procedures/RunClusteringProcedure.scala|  9 -
 .../procedures/RunCompactionProcedure.scala|  8 
 .../apache/hudi/utilities/HoodieClusteringJob.java |  4 ++
 .../org/apache/hudi/utilities/HoodieCompactor.java |  4 ++
 .../org/apache/hudi/utilities/UtilHelpers.java |  8 
 .../org/apache/hudi/utilities/TestUtilHelpers.java | 44 ++
 10 files changed, 110 insertions(+), 15 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/FileSystemBasedLockProvider.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/FileSystemBasedLockProvider.java
index dc0f16ecb40..da7e71a2058 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/FileSystemBasedLockProvider.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/FileSystemBasedLockProvider.java
@@ -21,6 +21,7 @@ package org.apache.hudi.client.transaction.lock;
 
 import org.apache.hudi.common.config.HoodieCommonConfig;
 import org.apache.hudi.common.config.LockConfiguration;
+import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.common.fs.FSUtils;
 import org.apache.hudi.common.fs.StorageSchemes;
 import org.apache.hudi.common.lock.LockProvider;
@@ -29,6 +30,7 @@ import org.apache.hudi.common.table.HoodieTableMetaClient;
 import org.apache.hudi.common.util.FileIOUtils;
 import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.config.HoodieLockConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.exception.HoodieIOException;
 import org.apache.hudi.exception.HoodieLockException;
@@ -50,6 +52,7 @@ import java.util.concurrent.TimeUnit;
 
 import static 
org.apache.hudi.common.config.LockConfiguration.FILESYSTEM_LOCK_EXPIRE_PROP_KEY;
 import static 
org.apache.hudi.common.config.LockConfiguration.FILESYSTEM_LOCK_PATH_PROP_KEY;
+import static 
org.apache.hudi.common.table.HoodieTableMetaClient.AUXILIARYFOLDER_NAME;
 
 /**
  * A FileSystem based lock. This {@link LockProvider} implementation allows to 
lock table operations
@@ -201,4 +204,26 @@ public class FileSystemBasedLockProvider implements 
LockProvider, Serial
   || config.getConfig().getString(HoodieWriteConfig.BASE_PATH.key(), 
null) != null);
 
ValidationUtils.checkArgument(config.getConfig().getInteger(FILESYSTEM_LOCK_EXPIRE_PROP_KEY)
 >= 0);
   }
+
+  /**
+   * Returns a filesystem based lock config with given table path.
+   */
+  public static TypedProperties getLockConfig(String tablePath) {
+TypedProperties props = new TypedProperties();
+props.put(HoodieLockConfig.LOCK_PROVIDER_CLASS_NAME.key(), 
FileSystemBasedLockProvider.class.getName());
+props.put(HoodieLockConfig.LOCK_ACQUIRE_WAIT_TIMEOUT_MS.key(), "2000");
+props.put(HoodieLockConfig.FILESYSTEM_LOCK_EXPIRE.key(), "1");
+props.put(HoodieLockConfig.LOCK_ACQUIRE_CLIENT_NUM_RETRIES.key(), "30");
+props.put(HoodieLockConfig.FILESYSTEM_LOCK_PATH.key(), 
defaultLockPath(tablePath));
+return props;
+  }
+
+  /**
+   * Returns the default lock file root path.
+   *
+   * IMPORTANT: this path should be shared especially when there is engine 
cooperation.
+   */
+  private static String defaultLockPath(String tablePath) {
+return tablePath + Path.SEPARATOR + AUXILIARYFOLDER_NAME;
+  }
 }
diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/FlinkWriteClients.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/FlinkWriteClients.java
index 3e84e8493e3..6d8b0d0a7d6 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/FlinkWriteClients.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/FlinkWriteClients.java
@@ -227,12 +227,8 @@

[GitHub] [hudi] danny0405 merged pull request #9365: [HUDI-6646] Add default lock provider for spark offline compaction an…

2023-08-05 Thread via GitHub


danny0405 merged PR #9365:
URL: https://github.com/apache/hudi/pull/9365


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #9365: [HUDI-6646] Add default lock provider for spark offline compaction an…

2023-08-05 Thread via GitHub


danny0405 commented on PR #9365:
URL: https://github.com/apache/hudi/pull/9365#issuecomment-199059

   Tests have passed: 
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=19109&view=results


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9374: [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9374:
URL: https://github.com/apache/hudi/pull/9374#issuecomment-198867

   
   ## CI report:
   
   * ee5d3013bf73645648b344a468404dd076e80003 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6655) Fix TestWriteMergeOnRead#testConsistentBucketIndex

2023-08-05 Thread Danny Chen (Jira)
Danny Chen created HUDI-6655:


 Summary: Fix TestWriteMergeOnRead#testConsistentBucketIndex
 Key: HUDI-6655
 URL: https://issues.apache.org/jira/browse/HUDI-6655
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Danny Chen






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table (#9261)

2023-08-05 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 7061652e1bd [HUDI-6579] Adding support for upsert and deletes with 
spark datasource for pk less table (#9261)
7061652e1bd is described below

commit 7061652e1bda92037ae796edae3d18fb6ec64529
Author: Sivabalan Narayanan 
AuthorDate: Sat Aug 5 23:22:46 2023 -0400

[HUDI-6579] Adding support for upsert and deletes with spark datasource for 
pk less table (#9261)

Adding support for upsert and deletes with spark datasource for pk less 
table.
---
 .../hudi/metadata/HoodieMetadataWriteUtils.java|   2 +-
 .../apache/hudi/AutoRecordKeyGenerationUtils.scala |   6 +-
 .../main/scala/org/apache/hudi/DefaultSource.scala |  12 +--
 .../org/apache/hudi/HoodieCreateRecordUtils.scala  | 113 -
 .../org/apache/hudi/HoodieSparkSqlWriter.scala |  57 ++-
 .../scala/org/apache/hudi/HoodieWriterUtils.scala  |  34 ++-
 .../TestAutoGenerationOfRecordKeys.scala   |  82 ++-
 7 files changed, 175 insertions(+), 131 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java
index ad87b5287ca..2078896987d 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java
@@ -114,7 +114,7 @@ public class HoodieMetadataWriteUtils {
 // we will trigger archive manually, to ensure only regular writer 
invokes it
 .withArchivalConfig(HoodieArchivalConfig.newBuilder()
 .archiveCommitsWith(
-writeConfig.getMinCommitsToKeep(), 
writeConfig.getMaxCommitsToKeep())
+writeConfig.getMinCommitsToKeep() + 1, 
writeConfig.getMaxCommitsToKeep() + 1)
 .withAutoArchive(false)
 .build())
 // we will trigger compaction manually, to control the instant times
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala
index ca679acc799..501c563a989 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala
@@ -32,7 +32,7 @@ object AutoRecordKeyGenerationUtils {
   private val log = LoggerFactory.getLogger(getClass)
 
   def mayBeValidateParamsForAutoGenerationOfRecordKeys(parameters: Map[String, 
String], hoodieConfig: HoodieConfig): Unit = {
-val autoGenerateRecordKeys = 
!parameters.contains(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key()) // if 
record key is not configured,
+val autoGenerateRecordKeys = isAutoGenerateRecordKeys(parameters)
 // hudi will auto generate.
 if (autoGenerateRecordKeys) {
   // de-dup is not supported with auto generation of record keys
@@ -54,4 +54,8 @@ object AutoRecordKeyGenerationUtils {
   log.warn("Precombine field " + 
hoodieConfig.getString(PRECOMBINE_FIELD.key()) + " will be ignored with auto 
record key generation enabled")
 }
   }
+
+  def isAutoGenerateRecordKeys(parameters: Map[String, String]): Boolean = {
+!parameters.contains(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key()) // if 
record key is not configured,
+  }
 }
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala
index 0f159885d89..3e5cf351ba1 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala
@@ -19,7 +19,7 @@ package org.apache.hudi
 
 import org.apache.hadoop.fs.Path
 import org.apache.hudi.DataSourceReadOptions._
-import org.apache.hudi.DataSourceWriteOptions.{BOOTSTRAP_OPERATION_OPT_VAL, 
SPARK_SQL_WRITES_PREPPED_KEY, OPERATION, STREAMING_CHECKPOINT_IDENTIFIER}
+import org.apache.hudi.DataSourceWriteOptions.{BOOTSTRAP_OPERATION_OPT_VAL, 
OPERATION, RECORDKEY_FIELD, SPARK_SQL_WRITES_PREPPED_KEY, 
STREAMING_CHECKPOINT_IDENTIFIER}
 import org.apache.hudi.cdc.CDCRelation
 import org.apache.hudi.common.fs.FSUtils
 import org.apache.hudi.common.model.HoodieTableType.{COPY_ON_WRITE, 
MERGE_ON_READ}
@@ -29,7 +29,7 @@ import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
 import org.apache.hudi.common.util.ConfigUtils
 import org.apache.hudi.

[GitHub] [hudi] nsivabalan merged pull request #9261: [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table

2023-08-05 Thread via GitHub


nsivabalan merged PR #9261:
URL: https://github.com/apache/hudi/pull/9261


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] eric9204 closed pull request #9340: [HUDI-6630] Automatic release connection for hoodie metaserver client

2023-08-05 Thread via GitHub


eric9204 closed pull request #9340: [HUDI-6630] Automatic release connection 
for hoodie metaserver client
URL: https://github.com/apache/hudi/pull/9340


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 opened a new pull request, #9375: [MINOR] Disable the MDT for flink by default

2023-08-05 Thread via GitHub


danny0405 opened a new pull request, #9375:
URL: https://github.com/apache/hudi/pull/9375

   ### Change Logs
   
   Disable because there are some cross-engine issues that are stil suspended.
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] eric9204 commented on pull request #9340: [HUDI-6630] Automatic release connection for hoodie metaserver client

2023-08-05 Thread via GitHub


eric9204 commented on PR #9340:
URL: https://github.com/apache/hudi/pull/9340#issuecomment-191795

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 closed pull request #7875: [MINOR] No need to close the metadata writer twice

2023-08-05 Thread via GitHub


danny0405 closed pull request #7875: [MINOR] No need to close the metadata 
writer twice
URL: https://github.com/apache/hudi/pull/7875


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7875: [MINOR] No need to close the metadata writer twice

2023-08-05 Thread via GitHub


hudi-bot commented on PR #7875:
URL: https://github.com/apache/hudi/pull/7875#issuecomment-188590

   
   ## CI report:
   
   * f47bd395c4f4569fa19cb99cb4882078ec683aca Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14989)
 
   * 4f79281e0bf502a648f48e1fee142cff03ccfe71 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9374: [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync

2023-08-05 Thread via GitHub


danny0405 commented on code in PR #9374:
URL: https://github.com/apache/hudi/pull/9374#discussion_r1285145028


##
hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SyncUtilHelpers.java:
##
@@ -33,36 +32,56 @@
 
 import java.util.Map;
 import java.util.Properties;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.locks.Lock;
+import java.util.concurrent.locks.ReentrantLock;
+
+import static 
org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT;
+import static org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_BASE_PATH;
+import static 
org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_TABLE_NAME;
 
 /**
  * Helper class for syncing Hudi commit data with external metastores.
  */
 public class SyncUtilHelpers {
   private static final Logger LOG = 
LoggerFactory.getLogger(SyncUtilHelpers.class);
+
+  // Locks for each table (base path) to avoid concurrent modification of the 
same underneath meta storage.
+  // Meta store such as Hive may encounter {@code 
ConcurrentModificationException} for #alter_table.
+  private static final ConcurrentHashMap TABLE_LOCKS = new 
ConcurrentHashMap<>();
+
   /**
* Create an instance of an implementation of {@link HoodieSyncTool} that 
will sync all the relevant meta information
* with an external metastore such as Hive etc. to ensure Hoodie tables can 
be queried or read via external systems.
*
-   * IMPORTANT: make this method class level thread safe to avoid 
concurrent modification of the same underneath meta storage.
-   * Meta store such as Hive may encounter {@code 
ConcurrentModificationException} for #alter_table.
-   *
-   * @param syncToolClassName   Class name of the {@link HoodieSyncTool} 
implementation.
-   * @param props   property map.
-   * @param hadoopConfigHadoop confs.
-   * @param fs  Filesystem used.
-   * @param targetBasePath  The target base path that contains the hoodie 
table.
-   * @param baseFileFormat  The file format used by the hoodie table 
(defaults to PARQUET).
+   * @param syncToolClassName Class name of the {@link HoodieSyncTool} 
implementation.
+   * @param props property map.
+   * @param hadoopConfig  Hadoop confs.
+   * @param fsFilesystem used.
+   * @param targetBasePathThe target base path that contains the hoodie 
table.
+   * @param baseFileFormatThe file format used by the hoodie table 
(defaults to PARQUET).
*/
-  public static synchronized void runHoodieMetaSync(String syncToolClassName,
+  public static void runHoodieMetaSync(String syncToolClassName,
TypedProperties props,
Configuration hadoopConfig,
FileSystem fs,
String targetBasePath,
String baseFileFormat) {
-try (HoodieSyncTool syncTool = instantiateMetaSyncTool(syncToolClassName, 
props, hadoopConfig, fs, targetBasePath, baseFileFormat)) {
-  syncTool.syncHoodieTable();
-} catch (Throwable e) {
-  throw new HoodieMetaSyncException("Could not sync using the meta sync 
class " + syncToolClassName, e);
+if (targetBasePath == null) {
+  throw new IllegalArgumentException("Target base path must not be null");
+}
+
+// Get or create a lock for the specific table
+Lock tableLock = TABLE_LOCKS.computeIfAbsent(targetBasePath, k -> new 
ReentrantLock());

Review Comment:
   yeah, we can always extend it with more sophisticated requests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9280: [HUDI-6587] Check incomplete commit for time travel query

2023-08-05 Thread via GitHub


nsivabalan commented on code in PR #9280:
URL: https://github.com/apache/hudi/pull/9280#discussion_r1285144748


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java:
##
@@ -315,6 +318,29 @@ public static Option 
getEarliestInstantForMetadataArchival(
 }
   }
 
+  /**
+   * Validate user-specified timestamp of time travel query against incomplete 
commit's timestamp.
+   *
+   * @throws HoodieException when time travel query's timestamp >= incomplete 
commit's timestamp
+   */
+  public static void validateTimestampAsOf(HoodieTableMetaClient metaClient, 
String timestampAsOf) {

Review Comment:
   are we not supporting the 3 enum values? looks like we are going w/ FAIL 
directly ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9367: [HUDI-6648] Support building table views from existing files

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9367:
URL: https://github.com/apache/hudi/pull/9367#issuecomment-185745

   
   ## CI report:
   
   * cfb02dbb0dd38da5a1e375ceb3436baacc53ad49 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19118)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #9071: [HUDI-6453] Cascade Glue schema changes to partitions

2023-08-05 Thread via GitHub


nsivabalan commented on PR #9071:
URL: https://github.com/apache/hudi/pull/9071#issuecomment-182663

   hey @CTTY @parisni : We are looking to cut branch for 0.14.0 in 1 day. If 
you can coordinate and land the patch, would be good and we can pull it in. CC 
@prashantwason 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9371:
URL: https://github.com/apache/hudi/pull/9371#issuecomment-175249

   
   ## CI report:
   
   * 0e39684d85c18aef48131ba838ffc63e48b5fcf2 UNKNOWN
   * 5bc292aa241347e9a985e70e1335a9d7309d51aa Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19117)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] codope commented on a diff in pull request #9374: [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync

2023-08-05 Thread via GitHub


codope commented on code in PR #9374:
URL: https://github.com/apache/hudi/pull/9374#discussion_r1285138372


##
hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SyncUtilHelpers.java:
##
@@ -33,36 +32,56 @@
 
 import java.util.Map;
 import java.util.Properties;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.locks.Lock;
+import java.util.concurrent.locks.ReentrantLock;
+
+import static 
org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT;
+import static org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_BASE_PATH;
+import static 
org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_TABLE_NAME;
 
 /**
  * Helper class for syncing Hudi commit data with external metastores.
  */
 public class SyncUtilHelpers {
   private static final Logger LOG = 
LoggerFactory.getLogger(SyncUtilHelpers.class);
+
+  // Locks for each table (base path) to avoid concurrent modification of the 
same underneath meta storage.
+  // Meta store such as Hive may encounter {@code 
ConcurrentModificationException} for #alter_table.
+  private static final ConcurrentHashMap TABLE_LOCKS = new 
ConcurrentHashMap<>();
+
   /**
* Create an instance of an implementation of {@link HoodieSyncTool} that 
will sync all the relevant meta information
* with an external metastore such as Hive etc. to ensure Hoodie tables can 
be queried or read via external systems.
*
-   * IMPORTANT: make this method class level thread safe to avoid 
concurrent modification of the same underneath meta storage.
-   * Meta store such as Hive may encounter {@code 
ConcurrentModificationException} for #alter_table.
-   *
-   * @param syncToolClassName   Class name of the {@link HoodieSyncTool} 
implementation.
-   * @param props   property map.
-   * @param hadoopConfigHadoop confs.
-   * @param fs  Filesystem used.
-   * @param targetBasePath  The target base path that contains the hoodie 
table.
-   * @param baseFileFormat  The file format used by the hoodie table 
(defaults to PARQUET).
+   * @param syncToolClassName Class name of the {@link HoodieSyncTool} 
implementation.
+   * @param props property map.
+   * @param hadoopConfig  Hadoop confs.
+   * @param fsFilesystem used.
+   * @param targetBasePathThe target base path that contains the hoodie 
table.
+   * @param baseFileFormatThe file format used by the hoodie table 
(defaults to PARQUET).
*/
-  public static synchronized void runHoodieMetaSync(String syncToolClassName,
+  public static void runHoodieMetaSync(String syncToolClassName,
TypedProperties props,
Configuration hadoopConfig,
FileSystem fs,
String targetBasePath,
String baseFileFormat) {
-try (HoodieSyncTool syncTool = instantiateMetaSyncTool(syncToolClassName, 
props, hadoopConfig, fs, targetBasePath, baseFileFormat)) {
-  syncTool.syncHoodieTable();
-} catch (Throwable e) {
-  throw new HoodieMetaSyncException("Could not sync using the meta sync 
class " + syncToolClassName, e);
+if (targetBasePath == null) {
+  throw new IllegalArgumentException("Target base path must not be null");
+}
+
+// Get or create a lock for the specific table
+Lock tableLock = TABLE_LOCKS.computeIfAbsent(targetBasePath, k -> new 
ReentrantLock());

Review Comment:
   I noticed that for the same table different sync tools run one after the 
other, so did not see the need to concat tool class to the key. In future, if 
we plan to run everything concurrently, we can add tool class the the key.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6592) Flink insert overwrite should support dynamic partition instead of whole table

2023-08-05 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-6592:
-
Fix Version/s: 0.14.0

> Flink insert overwrite should support dynamic partition instead of whole table
> --
>
> Key: HUDI-6592
> URL: https://issues.apache.org/jira/browse/HUDI-6592
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Flink insert overwrite should support dynamic partition instead of the whole 
> table, which behavior is consistent with the semantics of insert overwrite in 
> Flink.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6592) Flink insert overwrite should support dynamic partition instead of whole table

2023-08-05 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6592.

Resolution: Fixed

Fixed via master branch: 8c6e8d92cad465925d5bf165deccc18229efcd1f

> Flink insert overwrite should support dynamic partition instead of whole table
> --
>
> Key: HUDI-6592
> URL: https://issues.apache.org/jira/browse/HUDI-6592
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Flink insert overwrite should support dynamic partition instead of the whole 
> table, which behavior is consistent with the semantics of insert overwrite in 
> Flink.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [HUDI-6592] Flink insert overwrite should support dynamic partition and whole table (#9287)

2023-08-05 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 8c6e8d92cad [HUDI-6592] Flink insert overwrite should support dynamic 
partition and whole table (#9287)
8c6e8d92cad is described below

commit 8c6e8d92cad465925d5bf165deccc18229efcd1f
Author: Nicholas Jiang 
AuthorDate: Sun Aug 6 08:15:19 2023 +0800

[HUDI-6592] Flink insert overwrite should support dynamic partition and 
whole table (#9287)
---
 .../apache/hudi/configuration/FlinkOptions.java| 11 +++
 .../apache/hudi/configuration/OptionsResolver.java |  9 ++
 .../sink/overwrite/PartitionOverwriteMode.java | 35 ++
 .../org/apache/hudi/table/HoodieTableSink.java | 11 ---
 .../apache/hudi/table/ITTestHoodieDataSource.java  | 27 +
 .../test/java/org/apache/hudi/utils/TestData.java  | 16 ++
 6 files changed, 99 insertions(+), 10 deletions(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
index c140d40af88..556d0b2ef2b 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
@@ -39,6 +39,7 @@ import org.apache.hudi.hive.ddl.HiveSyncMode;
 import org.apache.hudi.index.HoodieIndex;
 import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
 import org.apache.hudi.keygen.constant.KeyGeneratorType;
+import org.apache.hudi.sink.overwrite.PartitionOverwriteMode;
 import org.apache.hudi.table.action.cluster.ClusteringPlanPartitionFilterMode;
 import org.apache.hudi.util.ClientIds;
 
@@ -613,6 +614,16 @@ public class FlinkOptions extends HoodieConfig {
   .defaultValue(128)
   .withDescription("Sort memory in MB, default 128MB");
 
+  @AdvancedConfig
+  public static final ConfigOption WRITE_PARTITION_OVERWRITE_MODE = 
ConfigOptions
+  .key("write.partition.overwrite.mode")
+  .stringType()
+  .defaultValue(PartitionOverwriteMode.STATIC.name())
+  .withDescription("When INSERT OVERWRITE a partitioned data source table, 
we currently support 2 modes: static and dynamic. "
+  + "Static mode deletes all the partitions that match the partition 
specification(e.g. PARTITION(a=1,b)) in the INSERT statement, before 
overwriting. "
+  + "Dynamic mode doesn't delete partitions ahead, and only overwrite 
those partitions that have data written into it at runtime. "
+  + "By default we use static mode to keep the same behavior of 
previous version.");
+
   // this is only for internal use
   @AdvancedConfig
   public static final ConfigOption WRITE_CLIENT_ID = ConfigOptions
diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java
index 944e795dc2f..bfde0b0e2b0 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java
@@ -35,6 +35,7 @@ import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.index.HoodieIndex;
 import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
+import org.apache.hudi.sink.overwrite.PartitionOverwriteMode;
 import org.apache.hudi.table.format.FilePathUtils;
 
 import org.apache.flink.configuration.ConfigOption;
@@ -241,6 +242,14 @@ public class OptionsResolver {
 || 
conf.getString(FlinkOptions.OPERATION).equalsIgnoreCase(WriteOperationType.INSERT_OVERWRITE.value());
   }
 
+  /**
+   * Returns whether the operation is INSERT OVERWRITE dynamic partition.
+   */
+  public static boolean overwriteDynamicPartition(Configuration conf) {
+return 
conf.getString(FlinkOptions.OPERATION).equalsIgnoreCase(WriteOperationType.INSERT_OVERWRITE.value())
+|| 
conf.getString(FlinkOptions.WRITE_PARTITION_OVERWRITE_MODE).equalsIgnoreCase(PartitionOverwriteMode.DYNAMIC.name());
+  }
+
   /**
* Returns whether the read start commit is specific commit timestamp.
*/
diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/overwrite/PartitionOverwriteMode.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/overwrite/PartitionOverwriteMode.java
new file mode 100644
index 000..700c7432e25
--- /dev/null
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/overwrite/PartitionOverwriteMode.java
@@ -0,0 +1,35 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) unde

[GitHub] [hudi] danny0405 merged pull request #9287: [HUDI-6592] Flink insert overwrite should support dynamic partition and whole table

2023-08-05 Thread via GitHub


danny0405 merged PR #9287:
URL: https://github.com/apache/hudi/pull/9287


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6654) Encode record positions in log blocks

2023-08-05 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6654:

Description: To support position-based merging of base and log files, we 
need to encode positions in the log blocks so that the positions can be used 
directly, without having to deserialize records or delete keys for 
OverwriteWithLatest payload, or with ordering values required only for 
`DefaultHoodieRecordPayload` supporting event time based streaming.

> Encode record positions in log blocks
> -
>
> Key: HUDI-6654
> URL: https://issues.apache.org/jira/browse/HUDI-6654
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
>
> To support position-based merging of base and log files, we need to encode 
> positions in the log blocks so that the positions can be used directly, 
> without having to deserialize records or delete keys for OverwriteWithLatest 
> payload, or with ordering values required only for 
> `DefaultHoodieRecordPayload` supporting event time based streaming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #9261: [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9261:
URL: https://github.com/apache/hudi/pull/9261#issuecomment-142088

   
   ## CI report:
   
   * edd7d00172e57421cc50dcfcc2c90e4b909f67bd Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19116)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-6654) Encode record positions in log blocks

2023-08-05 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-6654:
---

Assignee: Ethan Guo

> Encode record positions in log blocks
> -
>
> Key: HUDI-6654
> URL: https://issues.apache.org/jira/browse/HUDI-6654
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6653) Support position-based merging of base and log files

2023-08-05 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-6653:
---

Assignee: Ethan Guo

> Support position-based merging of base and log files
> 
>
> Key: HUDI-6653
> URL: https://issues.apache.org/jira/browse/HUDI-6653
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6654) Encode record positions in log blocks

2023-08-05 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-6654:
---

 Summary: Encode record positions in log blocks
 Key: HUDI-6654
 URL: https://issues.apache.org/jira/browse/HUDI-6654
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6654) Encode record positions in log blocks

2023-08-05 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6654:

Fix Version/s: 0.14.0

> Encode record positions in log blocks
> -
>
> Key: HUDI-6654
> URL: https://issues.apache.org/jira/browse/HUDI-6654
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6654) Encode record positions in log blocks

2023-08-05 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6654:

Priority: Blocker  (was: Major)

> Encode record positions in log blocks
> -
>
> Key: HUDI-6654
> URL: https://issues.apache.org/jira/browse/HUDI-6654
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6653) Support position-based merging of base and log files

2023-08-05 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-6653:
---

 Summary: Support position-based merging of base and log files
 Key: HUDI-6653
 URL: https://issues.apache.org/jira/browse/HUDI-6653
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #9367: [HUDI-6648] Support building table views from existing files

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9367:
URL: https://github.com/apache/hudi/pull/9367#issuecomment-132116

   
   ## CI report:
   
   * 3da7a8536e1c4bb4a7450d59b0ad32e9ed048c20 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19101)
 
   * cfb02dbb0dd38da5a1e375ceb3436baacc53ad49 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19118)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9371:
URL: https://github.com/apache/hudi/pull/9371#issuecomment-132137

   
   ## CI report:
   
   * 0e39684d85c18aef48131ba838ffc63e48b5fcf2 UNKNOWN
   * 33bbbdb4142bf4eeae9696b3f3d1e795e6970ae3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19103)
 
   * 5bc292aa241347e9a985e70e1335a9d7309d51aa Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19117)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9371:
URL: https://github.com/apache/hudi/pull/9371#issuecomment-130313

   
   ## CI report:
   
   * 0e39684d85c18aef48131ba838ffc63e48b5fcf2 UNKNOWN
   * 33bbbdb4142bf4eeae9696b3f3d1e795e6970ae3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19103)
 
   * 5bc292aa241347e9a985e70e1335a9d7309d51aa UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9367: [HUDI-6648] Support building table views from existing files

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9367:
URL: https://github.com/apache/hudi/pull/9367#issuecomment-130298

   
   ## CI report:
   
   * 3da7a8536e1c4bb4a7450d59b0ad32e9ed048c20 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19101)
 
   * cfb02dbb0dd38da5a1e375ceb3436baacc53ad49 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9374: [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync

2023-08-05 Thread via GitHub


the-other-tim-brown commented on code in PR #9374:
URL: https://github.com/apache/hudi/pull/9374#discussion_r1285129506


##
hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SyncUtilHelpers.java:
##
@@ -33,36 +32,56 @@
 
 import java.util.Map;
 import java.util.Properties;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.locks.Lock;
+import java.util.concurrent.locks.ReentrantLock;
+
+import static 
org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT;
+import static org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_BASE_PATH;
+import static 
org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_TABLE_NAME;
 
 /**
  * Helper class for syncing Hudi commit data with external metastores.
  */
 public class SyncUtilHelpers {
   private static final Logger LOG = 
LoggerFactory.getLogger(SyncUtilHelpers.class);
+
+  // Locks for each table (base path) to avoid concurrent modification of the 
same underneath meta storage.
+  // Meta store such as Hive may encounter {@code 
ConcurrentModificationException} for #alter_table.
+  private static final ConcurrentHashMap TABLE_LOCKS = new 
ConcurrentHashMap<>();
+
   /**
* Create an instance of an implementation of {@link HoodieSyncTool} that 
will sync all the relevant meta information
* with an external metastore such as Hive etc. to ensure Hoodie tables can 
be queried or read via external systems.
*
-   * IMPORTANT: make this method class level thread safe to avoid 
concurrent modification of the same underneath meta storage.
-   * Meta store such as Hive may encounter {@code 
ConcurrentModificationException} for #alter_table.
-   *
-   * @param syncToolClassName   Class name of the {@link HoodieSyncTool} 
implementation.
-   * @param props   property map.
-   * @param hadoopConfigHadoop confs.
-   * @param fs  Filesystem used.
-   * @param targetBasePath  The target base path that contains the hoodie 
table.
-   * @param baseFileFormat  The file format used by the hoodie table 
(defaults to PARQUET).
+   * @param syncToolClassName Class name of the {@link HoodieSyncTool} 
implementation.
+   * @param props property map.
+   * @param hadoopConfig  Hadoop confs.
+   * @param fsFilesystem used.
+   * @param targetBasePathThe target base path that contains the hoodie 
table.
+   * @param baseFileFormatThe file format used by the hoodie table 
(defaults to PARQUET).
*/
-  public static synchronized void runHoodieMetaSync(String syncToolClassName,
+  public static void runHoodieMetaSync(String syncToolClassName,
TypedProperties props,
Configuration hadoopConfig,
FileSystem fs,
String targetBasePath,
String baseFileFormat) {
-try (HoodieSyncTool syncTool = instantiateMetaSyncTool(syncToolClassName, 
props, hadoopConfig, fs, targetBasePath, baseFileFormat)) {
-  syncTool.syncHoodieTable();
-} catch (Throwable e) {
-  throw new HoodieMetaSyncException("Could not sync using the meta sync 
class " + syncToolClassName, e);
+if (targetBasePath == null) {
+  throw new IllegalArgumentException("Target base path must not be null");
+}
+
+// Get or create a lock for the specific table
+Lock tableLock = TABLE_LOCKS.computeIfAbsent(targetBasePath, k -> new 
ReentrantLock());

Review Comment:
   What are your thoughts on keying off of the table base path and the 
syncToolClassName to allow concurrent updates to multiple meta syncs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files

2023-08-05 Thread via GitHub


the-other-tim-brown commented on code in PR #9367:
URL: https://github.com/apache/hudi/pull/9367#discussion_r1285129324


##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestExternalPathHandling.java:
##
@@ -0,0 +1,293 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.functional;
+
+import org.apache.hudi.avro.model.HoodieActionInstant;
+import org.apache.hudi.avro.model.HoodieCleanFileInfo;
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.avro.model.HoodieCleanerPlan;
+import org.apache.hudi.avro.model.HoodieMetadataColumnStats;
+import org.apache.hudi.avro.model.IntWrapper;
+import org.apache.hudi.avro.model.StringWrapper;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.HoodieCleanStat;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieCleaningPolicy;
+import org.apache.hudi.common.model.HoodieColumnRangeMetadata;
+import org.apache.hudi.common.model.HoodieDeltaWriteStat;
+import org.apache.hudi.common.model.HoodieFileGroup;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.CleanerUtils;
+import org.apache.hudi.common.util.ExternalFilePathUtil;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieArchivalConfig;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.metadata.HoodieBackedTableMetadata;
+import org.apache.hudi.metadata.HoodieMetadataFileSystemView;
+import org.apache.hudi.metadata.HoodieTableMetadataWriter;
+import org.apache.hudi.table.action.clean.CleanPlanner;
+import org.apache.hudi.testutils.HoodieClientTestBase;
+
+import org.apache.spark.api.java.JavaRDD;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.index.HoodieIndex.IndexType.INMEMORY;
+
+/**
+ * Asserts that tables initialized from file paths created outside Hudi can 
properly be loaded.
+ */
+public class TestExternalPathHandling extends HoodieClientTestBase {
+
+  private static final String FIELD_1 = "field1";
+  private static final String FIELD_2 = "field2";
+  private HoodieWriteConfig writeConfig;
+
+  @ParameterizedTest
+  @MethodSource("getArgs")
+  public void testFlow(FileIdAndNameGenerator fileIdAndNameGenerator, 
List partitions) throws Exception {
+metaClient = HoodieTableMetaClient.reload(metaClient);
+writeConfig = HoodieWriteConfig.newBuilder()
+
.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(INMEMORY).build())
+.withPath(metaClient.getBasePathV2().toString())
+.withEmbeddedTimelineServerEnabled(false)
+.withMetadataConfig(HoodieMetadataConfig.newBuilder()
+.withMaxNumDeltaCommitsBeforeCompaction(2)
+.withFileSystemBootstrapDisabled(true)
+.enable(true)
+.withMetadataIndexColumnStats(true)
+.withColumnStatsIndexForColumns(FIELD_1 + "," + FIELD_2)
+.build())
+
.withArchivalConfig(HoodieArchivalConfig.newBuilder().archiveCommitsWith(1, 
2).build())
+.withTableServicesEnabled(true)
+.build();
+
+writeClient = getHoodieWriteClient(writeConfig);
+String instantTime1 = 
writeClient.startCommit(HoodieT

[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files

2023-08-05 Thread via GitHub


the-other-tim-brown commented on code in PR #9367:
URL: https://github.com/apache/hudi/pull/9367#discussion_r1285129179


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieBaseFile.java:
##
@@ -61,16 +62,39 @@ public HoodieBaseFile(String filePath) {
   public HoodieBaseFile(String filePath, BaseFile bootstrapBaseFile) {
 super(filePath);
 this.bootstrapBaseFile = Option.ofNullable(bootstrapBaseFile);
-String[] fileIdAndCommitTime = getFileIdAndCommitTimeFromFileName();
+String[] fileIdAndCommitTime = 
getFileIdAndCommitTimeFromFileName(getFileName());
 this.fileId = fileIdAndCommitTime[0];
 this.commitTime = fileIdAndCommitTime[1];
   }
 
+  public HoodieBaseFile(String filePath, String fileId, String commitTime, 
BaseFile bootstrapBaseFile) {
+super(filePath);
+this.bootstrapBaseFile = Option.ofNullable(bootstrapBaseFile);
+this.fileId = fileId;
+this.commitTime = commitTime;
+  }
+
+  private HoodieBaseFile(FileStatus fileStatus, String[] fileIdAndCommitTime, 
BaseFile bootstrapBaseFile) {
+this(fileStatus, fileIdAndCommitTime[0], fileIdAndCommitTime[1], 
bootstrapBaseFile);
+  }
+
+  public HoodieBaseFile(FileStatus fileStatus, String fileId, String 
commitTime, BaseFile bootstrapBaseFile) {
+super(handleExternallyGeneratedFileName(fileStatus, fileId));

Review Comment:
   Ok makes sense, updating the naming and adding a java doc for more clarity 
for future developers



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files

2023-08-05 Thread via GitHub


the-other-tim-brown commented on code in PR #9367:
URL: https://github.com/apache/hudi/pull/9367#discussion_r1285128589


##
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##
@@ -312,6 +312,13 @@ public final class HoodieMetadataConfig extends 
HoodieConfig {
   .withDocumentation("Maximum size in bytes of a single log file. Larger 
log files can contain larger log blocks "
   + "thereby reducing the number of blocks to search for keys");
 
+  public static final ConfigProperty DISABLE_FILESYSTEM_BOOTSTRAP = 
ConfigProperty
+  .key(METADATA_PREFIX + ".filesystem.bootstrap.disabled")
+  .defaultValue(false)
+  .sinceVersion("0.14.0")
+  .withDocumentation("Disable bootstrapping metadata table from the file 
system when the table is first created. "

Review Comment:
   Any more thoughts here? I'm not sure about naming conventions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files

2023-08-05 Thread via GitHub


the-other-tim-brown commented on code in PR #9367:
URL: https://github.com/apache/hudi/pull/9367#discussion_r1285128491


##
hudi-common/src/main/java/org/apache/hudi/common/util/ExternalFilePathUtil.java:
##
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+/**
+ * Utility methods for handling externally created files.
+ */
+public class ExternalFilePathUtil {
+  // Suffix acts as a marker when appended to a file path that the path was 
created by an external system and not a Hudi writer.
+  private static final String EXTERNAL_FILE_SUFFIX = "_hudiext";

Review Comment:
   For ease of appending this to the path in the write stats that are passed in 
the commit. Now we go from something like  `partition1/file_1.parquet` to 
`partition1/file_1.parquet_20230805150812_hudiext` instead of  
`partition1/hudi_ext_20230805150812_file_1.parquet`. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files

2023-08-05 Thread via GitHub


the-other-tim-brown commented on code in PR #9367:
URL: https://github.com/apache/hudi/pull/9367#discussion_r1285128263


##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestExternalPathHandling.java:
##
@@ -0,0 +1,293 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.functional;
+
+import org.apache.hudi.avro.model.HoodieActionInstant;
+import org.apache.hudi.avro.model.HoodieCleanFileInfo;
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.avro.model.HoodieCleanerPlan;
+import org.apache.hudi.avro.model.HoodieMetadataColumnStats;
+import org.apache.hudi.avro.model.IntWrapper;
+import org.apache.hudi.avro.model.StringWrapper;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.HoodieCleanStat;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieCleaningPolicy;
+import org.apache.hudi.common.model.HoodieColumnRangeMetadata;
+import org.apache.hudi.common.model.HoodieDeltaWriteStat;
+import org.apache.hudi.common.model.HoodieFileGroup;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.CleanerUtils;
+import org.apache.hudi.common.util.ExternalFilePathUtil;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieArchivalConfig;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.metadata.HoodieBackedTableMetadata;
+import org.apache.hudi.metadata.HoodieMetadataFileSystemView;
+import org.apache.hudi.metadata.HoodieTableMetadataWriter;
+import org.apache.hudi.table.action.clean.CleanPlanner;
+import org.apache.hudi.testutils.HoodieClientTestBase;
+
+import org.apache.spark.api.java.JavaRDD;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.index.HoodieIndex.IndexType.INMEMORY;
+
+/**
+ * Asserts that tables initialized from file paths created outside Hudi can 
properly be loaded.
+ */
+public class TestExternalPathHandling extends HoodieClientTestBase {
+
+  private static final String FIELD_1 = "field1";
+  private static final String FIELD_2 = "field2";
+  private HoodieWriteConfig writeConfig;
+
+  @ParameterizedTest
+  @MethodSource("getArgs")
+  public void testFlow(FileIdAndNameGenerator fileIdAndNameGenerator, 
List partitions) throws Exception {
+metaClient = HoodieTableMetaClient.reload(metaClient);
+writeConfig = HoodieWriteConfig.newBuilder()
+
.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(INMEMORY).build())
+.withPath(metaClient.getBasePathV2().toString())
+.withEmbeddedTimelineServerEnabled(false)
+.withMetadataConfig(HoodieMetadataConfig.newBuilder()
+.withMaxNumDeltaCommitsBeforeCompaction(2)
+.withFileSystemBootstrapDisabled(true)
+.enable(true)
+.withMetadataIndexColumnStats(true)
+.withColumnStatsIndexForColumns(FIELD_1 + "," + FIELD_2)
+.build())
+
.withArchivalConfig(HoodieArchivalConfig.newBuilder().archiveCommitsWith(1, 
2).build())
+.withTableServicesEnabled(true)
+.build();
+
+writeClient = getHoodieWriteClient(writeConfig);
+String instantTime1 = 
writeClient.startCommit(HoodieT

[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality

2023-08-05 Thread via GitHub


the-other-tim-brown commented on code in PR #9371:
URL: https://github.com/apache/hudi/pull/9371#discussion_r1285128085


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java:
##
@@ -285,7 +373,29 @@ public HoodieWriteMetadata logCompact(String 
logCompactionInstantTime) {
* Commit Log Compaction and track metrics.
*/
   protected void completeLogCompaction(HoodieCommitMetadata metadata, 
HoodieTable table, String logCompactionCommitTime) {
-throw new UnsupportedOperationException("Log compaction is not supported 
yet.");
+this.context.setJobStatus(this.getClass().getSimpleName(), "Collect log 
compaction write status and commit compaction");
+List writeStats = metadata.getWriteStats();
+final HoodieInstant logCompactionInstant = new 
HoodieInstant(HoodieInstant.State.INFLIGHT, 
HoodieTimeline.LOG_COMPACTION_ACTION, logCompactionCommitTime);

Review Comment:
   Added



##
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/HoodieFlinkTableServiceClient.java:
##
@@ -243,7 +169,31 @@ public HoodieWriteMetadata> 
cluster(String clusteringInstant,
   }
 
   @Override
-  protected HoodieTable createTable(HoodieWriteConfig config, 
Configuration hadoopConf) {
+  protected void 
validateClusteringCommit(HoodieWriteMetadata> 
clusteringMetadata, String clusteringCommitTime, HoodieTable table) {

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality

2023-08-05 Thread via GitHub


the-other-tim-brown commented on code in PR #9371:
URL: https://github.com/apache/hudi/pull/9371#discussion_r1285128066


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java:
##
@@ -75,10 +83,13 @@
 import java.util.stream.Collectors;
 import java.util.stream.Stream;
 
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMPACTION_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.GREATER_THAN;
 import static org.apache.hudi.common.util.ValidationUtils.checkArgument;
 import static 
org.apache.hudi.metadata.HoodieTableMetadataUtil.isIndexingCommit;
 
-public abstract class BaseHoodieTableServiceClient extends BaseHoodieClient 
implements RunsTableService {
+public abstract class BaseHoodieTableServiceClient extends 
BaseHoodieClient implements RunsTableService {

Review Comment:
   Added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality

2023-08-05 Thread via GitHub


the-other-tim-brown commented on code in PR #9371:
URL: https://github.com/apache/hudi/pull/9371#discussion_r1285127678


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java:
##
@@ -974,4 +1159,20 @@ public void close() {
 // Stop timeline-server if running
 super.close();
   }
+
+  protected void updateTableMetadata(HoodieTable table, HoodieCommitMetadata 
commitMetadata,
+   HoodieInstant hoodieInstant,
+   HoodieData writeStatuses) {
+// Do not do any conflict resolution here as we do with regular writes. We 
take the lock here to ensure all writes to metadata table happens within a
+// single lock (single writer). Because more than one write to metadata 
table will result in conflicts since all of them updates the same partition.
+table.getMetadataWriter(hoodieInstant.getTimestamp())
+.ifPresent(writer -> ((HoodieTableMetadataWriter) 
writer).update(commitMetadata, writeStatuses, hoodieInstant.getTimestamp()));
+  }
+
+  protected void handleWriteErrors(List writeStats, 
TableServiceType tableServiceType) {

Review Comment:
   Ok sounds good



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality

2023-08-05 Thread via GitHub


the-other-tim-brown commented on code in PR #9371:
URL: https://github.com/apache/hudi/pull/9371#discussion_r1285127206


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java:
##
@@ -244,12 +305,39 @@ public Option 
scheduleCompaction(Option> extraMetada
* @param metadata  All the metadata that gets stored along with 
a commit
* @param extraMetadata Extra Metadata to be stored
*/
-  public abstract void commitCompaction(String compactionInstantTime, 
HoodieCommitMetadata metadata, Option> extraMetadata);
+  public void commitCompaction(String compactionInstantTime, 
HoodieCommitMetadata metadata, Option> extraMetadata) {
+extraMetadata.ifPresent(m -> m.forEach(metadata::addMetadata));
+completeCompaction(metadata, createTable(config, 
context.getHadoopConf().get()), compactionInstantTime);
+  }
 
   /**
* Commit Compaction and track metrics.
*/
-  protected abstract void completeCompaction(HoodieCommitMetadata metadata, 
HoodieTable table, String compactionCommitTime);
+  protected void completeCompaction(HoodieCommitMetadata metadata, HoodieTable 
table, String compactionCommitTime) {
+this.context.setJobStatus(this.getClass().getSimpleName(), "Collect 
compaction write status and commit compaction: " + config.getTableName());
+List writeStats = metadata.getWriteStats();
+handleWriteErrors(writeStats, TableServiceType.COMPACT);
+final HoodieInstant compactionInstant = 
HoodieTimeline.getCompactionInflightInstant(compactionCommitTime);
+try {
+  this.txnManager.beginTransaction(Option.of(compactionInstant), 
Option.empty());
+  finalizeWrite(table, compactionCommitTime, writeStats);
+  // commit to data table after committing to metadata table.
+  updateTableMetadata(table, metadata, compactionInstant, 
context.emptyHoodieData());

Review Comment:
   Do you know why the spark implementation isn't calling the 
`writeTableMetadata` which each client is required to implement? the 
implementations are very similar so we may be able to standardize



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9345: [HUDI-6649] Fix column stat based data filtering for MOR

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9345:
URL: https://github.com/apache/hudi/pull/9345#issuecomment-117084

   
   ## CI report:
   
   * 2874d5bf688f588dde8f31d8b266c0a60c3f11bb UNKNOWN
   * 5d45f2e15f3c08faba116d207c72d52f1c185321 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19115)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality

2023-08-05 Thread via GitHub


the-other-tim-brown commented on code in PR #9371:
URL: https://github.com/apache/hudi/pull/9371#discussion_r1285126911


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java:
##
@@ -174,11 +188,43 @@ private void inlineCompaction(HoodieTable table, 
Option> ext
   /**
* Ensures compaction instant is in expected state and performs Log 
Compaction for the workload stored in instant-time.s
*
-   * @param compactionInstantTime Compaction Instant Time
+   * @param logCompactionInstantTime Compaction Instant Time
* @return Collection of Write Status
*/
-  protected HoodieWriteMetadata logCompact(String compactionInstantTime, 
boolean shouldComplete) {
-throw new UnsupportedOperationException("Log compaction is not supported 
yet.");
+  protected HoodieWriteMetadata logCompact(String logCompactionInstantTime, 
boolean shouldComplete) {

Review Comment:
   this would be a breaking change in the event of someone else implementing 
custom logic in a write client outside of the hudi repo



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality

2023-08-05 Thread via GitHub


the-other-tim-brown commented on code in PR #9371:
URL: https://github.com/apache/hudi/pull/9371#discussion_r1285126721


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java:
##
@@ -146,7 +157,10 @@ protected void 
setPendingInflightAndRequestedInstants(Set pendingInfligh
* @param metadata commit metadata for which pre commit is being invoked.
*/
   protected void preCommit(HoodieCommitMetadata metadata) {
-// To be overridden by specific engines to perform conflict resolution if 
any.
+// Create a Hoodie table after startTxn which encapsulated the commits and 
files visible.

Review Comment:
   I think it makes sense to leave as protected so a future implementation can 
overwrite it if the want



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality

2023-08-05 Thread via GitHub


the-other-tim-brown commented on code in PR #9371:
URL: https://github.com/apache/hudi/pull/9371#discussion_r1285126662


##
hudi-client/hudi-java-client/src/main/java/org/apache/hudi/client/HoodieJavaTableServiceClient.java:
##
@@ -42,27 +48,31 @@ protected HoodieJavaTableServiceClient(HoodieEngineContext 
context,
   }
 
   @Override
-  protected HoodieWriteMetadata> compact(String 
compactionInstantTime, boolean shouldComplete) {
-throw new HoodieNotSupportedException("Compact is not supported in 
HoodieJavaTableServiceClient");
+  protected void 
validateClusteringCommit(HoodieWriteMetadata> 
clusteringMetadata, String clusteringCommitTime, HoodieTable table) {

Review Comment:
   Wouldn't validation be the same across clients though?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality

2023-08-05 Thread via GitHub


nsivabalan commented on code in PR #9371:
URL: https://github.com/apache/hudi/pull/9371#discussion_r1285121130


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java:
##
@@ -75,10 +83,13 @@
 import java.util.stream.Collectors;
 import java.util.stream.Stream;
 
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMPACTION_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.GREATER_THAN;
 import static org.apache.hudi.common.util.ValidationUtils.checkArgument;
 import static 
org.apache.hudi.metadata.HoodieTableMetadataUtil.isIndexingCommit;
 
-public abstract class BaseHoodieTableServiceClient extends BaseHoodieClient 
implements RunsTableService {
+public abstract class BaseHoodieTableServiceClient extends 
BaseHoodieClient implements RunsTableService {

Review Comment:
   can we add java docs for the class, also calling out what I, T, O generally 
refers to



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java:
##
@@ -285,7 +373,29 @@ public HoodieWriteMetadata logCompact(String 
logCompactionInstantTime) {
* Commit Log Compaction and track metrics.
*/
   protected void completeLogCompaction(HoodieCommitMetadata metadata, 
HoodieTable table, String logCompactionCommitTime) {
-throw new UnsupportedOperationException("Log compaction is not supported 
yet.");
+this.context.setJobStatus(this.getClass().getSimpleName(), "Collect log 
compaction write status and commit compaction");
+List writeStats = metadata.getWriteStats();
+final HoodieInstant logCompactionInstant = new 
HoodieInstant(HoodieInstant.State.INFLIGHT, 
HoodieTimeline.LOG_COMPACTION_ACTION, logCompactionCommitTime);

Review Comment:
handleWriteErrors(writeStats, TableServiceType.LOG_COMPACT); ?



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java:
##
@@ -146,7 +157,10 @@ protected void 
setPendingInflightAndRequestedInstants(Set pendingInfligh
* @param metadata commit metadata for which pre commit is being invoked.
*/
   protected void preCommit(HoodieCommitMetadata metadata) {
-// To be overridden by specific engines to perform conflict resolution if 
any.
+// Create a Hoodie table after startTxn which encapsulated the commits and 
files visible.

Review Comment:
   do we even need to have this protected now that no engine is overriding it



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java:
##
@@ -244,12 +305,39 @@ public Option 
scheduleCompaction(Option> extraMetada
* @param metadata  All the metadata that gets stored along with 
a commit
* @param extraMetadata Extra Metadata to be stored
*/
-  public abstract void commitCompaction(String compactionInstantTime, 
HoodieCommitMetadata metadata, Option> extraMetadata);
+  public void commitCompaction(String compactionInstantTime, 
HoodieCommitMetadata metadata, Option> extraMetadata) {
+extraMetadata.ifPresent(m -> m.forEach(metadata::addMetadata));
+completeCompaction(metadata, createTable(config, 
context.getHadoopConf().get()), compactionInstantTime);
+  }
 
   /**
* Commit Compaction and track metrics.
*/
-  protected abstract void completeCompaction(HoodieCommitMetadata metadata, 
HoodieTable table, String compactionCommitTime);
+  protected void completeCompaction(HoodieCommitMetadata metadata, HoodieTable 
table, String compactionCommitTime) {
+this.context.setJobStatus(this.getClass().getSimpleName(), "Collect 
compaction write status and commit compaction: " + config.getTableName());
+List writeStats = metadata.getWriteStats();
+handleWriteErrors(writeStats, TableServiceType.COMPACT);
+final HoodieInstant compactionInstant = 
HoodieTimeline.getCompactionInflightInstant(compactionCommitTime);
+try {
+  this.txnManager.beginTransaction(Option.of(compactionInstant), 
Option.empty());
+  finalizeWrite(table, compactionCommitTime, writeStats);
+  // commit to data table after committing to metadata table.
+  updateTableMetadata(table, metadata, compactionInstant, 
context.emptyHoodieData());

Review Comment:
   again, in flink this was named as writeTableMetadata, while in spark its 
updateTableMetadata and may be there are reasons for it. can we ensure we don't 
change anything for flink. 
   



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java:
##
@@ -974,4 +1159,20 @@ public void close() {
 // Stop timeline-server if running
 super.close();
   }
+
+  protected void updateTableMetadata(HoodieTable table, HoodieCommitMetadata 
commitMetadata,
+   Hood

[GitHub] [hudi] hudi-bot commented on pull request #9261: [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9261:
URL: https://github.com/apache/hudi/pull/9261#issuecomment-102755

   
   ## CI report:
   
   * c1977e21cdf02d75158bd3dec3b335e27755915c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19102)
 
   * edd7d00172e57421cc50dcfcc2c90e4b909f67bd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19116)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9261: [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9261:
URL: https://github.com/apache/hudi/pull/9261#issuecomment-101264

   
   ## CI report:
   
   * c1977e21cdf02d75158bd3dec3b335e27755915c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19102)
 
   * edd7d00172e57421cc50dcfcc2c90e4b909f67bd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files

2023-08-05 Thread via GitHub


nsivabalan commented on code in PR #9367:
URL: https://github.com/apache/hudi/pull/9367#discussion_r1285120895


##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestExternalPathHandling.java:
##
@@ -0,0 +1,293 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.functional;
+
+import org.apache.hudi.avro.model.HoodieActionInstant;
+import org.apache.hudi.avro.model.HoodieCleanFileInfo;
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.avro.model.HoodieCleanerPlan;
+import org.apache.hudi.avro.model.HoodieMetadataColumnStats;
+import org.apache.hudi.avro.model.IntWrapper;
+import org.apache.hudi.avro.model.StringWrapper;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.HoodieCleanStat;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieCleaningPolicy;
+import org.apache.hudi.common.model.HoodieColumnRangeMetadata;
+import org.apache.hudi.common.model.HoodieDeltaWriteStat;
+import org.apache.hudi.common.model.HoodieFileGroup;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.CleanerUtils;
+import org.apache.hudi.common.util.ExternalFilePathUtil;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieArchivalConfig;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.metadata.HoodieBackedTableMetadata;
+import org.apache.hudi.metadata.HoodieMetadataFileSystemView;
+import org.apache.hudi.metadata.HoodieTableMetadataWriter;
+import org.apache.hudi.table.action.clean.CleanPlanner;
+import org.apache.hudi.testutils.HoodieClientTestBase;
+
+import org.apache.spark.api.java.JavaRDD;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.index.HoodieIndex.IndexType.INMEMORY;
+
+/**
+ * Asserts that tables initialized from file paths created outside Hudi can 
properly be loaded.
+ */
+public class TestExternalPathHandling extends HoodieClientTestBase {
+
+  private static final String FIELD_1 = "field1";
+  private static final String FIELD_2 = "field2";
+  private HoodieWriteConfig writeConfig;
+
+  @ParameterizedTest
+  @MethodSource("getArgs")
+  public void testFlow(FileIdAndNameGenerator fileIdAndNameGenerator, 
List partitions) throws Exception {
+metaClient = HoodieTableMetaClient.reload(metaClient);
+writeConfig = HoodieWriteConfig.newBuilder()
+
.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(INMEMORY).build())
+.withPath(metaClient.getBasePathV2().toString())
+.withEmbeddedTimelineServerEnabled(false)
+.withMetadataConfig(HoodieMetadataConfig.newBuilder()
+.withMaxNumDeltaCommitsBeforeCompaction(2)
+.withFileSystemBootstrapDisabled(true)
+.enable(true)
+.withMetadataIndexColumnStats(true)
+.withColumnStatsIndexForColumns(FIELD_1 + "," + FIELD_2)
+.build())
+
.withArchivalConfig(HoodieArchivalConfig.newBuilder().archiveCommitsWith(1, 
2).build())
+.withTableServicesEnabled(true)
+.build();
+
+writeClient = getHoodieWriteClient(writeConfig);
+String instantTime1 = 
writeClient.startCommit(HoodieTimeline.R

[GitHub] [hudi] hudi-bot commented on pull request #9374: [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9374:
URL: https://github.com/apache/hudi/pull/9374#issuecomment-100068

   
   ## CI report:
   
   * ee5d3013bf73645648b344a468404dd076e80003 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19114)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9226: [HUDI-6352] take actual commit time (StateTransitionTime) into consid…

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9226:
URL: https://github.com/apache/hudi/pull/9226#issuecomment-1666599963

   
   ## CI report:
   
   * 747f9d37083a87e614e8109bebf4cc33e7368350 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19112)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files

2023-08-05 Thread via GitHub


nsivabalan commented on code in PR #9367:
URL: https://github.com/apache/hudi/pull/9367#discussion_r1285119922


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java:
##
@@ -324,33 +325,45 @@ public static HoodieRecord 
createPartitionListRecord(List
* @param partitionThe name of the partition
* @param filesAdded   Mapping of files to their sizes for files which have 
been added to this partition
* @param filesDeleted List of files which have been deleted from this 
partition
+   * @param instantTime  Commit time of the commit responsible for adding 
and/or deleting these files, will be empty during bootstrapping of the metadata 
table
*/
   public static HoodieRecord 
createPartitionFilesRecord(String partition,
-   
Option> filesAdded,
-   
Option> filesDeleted) {
-Map fileInfo = new HashMap<>();
-filesAdded.ifPresent(filesMap ->
-fileInfo.putAll(
-filesMap.entrySet().stream().collect(
-Collectors.toMap(Map.Entry::getKey, (entry) -> {
-  long fileSize = entry.getValue();
-  // Assert that the file-size of the file being added is 
positive, since Hudi
-  // should not be creating empty files
-  checkState(fileSize > 0);
-  return new HoodieMetadataFileInfo(fileSize, false);
-})))
-);
-filesDeleted.ifPresent(filesList ->
-fileInfo.putAll(
-filesList.stream().collect(
-Collectors.toMap(Function.identity(), (ignored) -> new 
HoodieMetadataFileInfo(0L, true
-);
+   
Map filesAdded,
+   
List filesDeleted,
+   
Option instantTime) {
+int size = filesAdded.size() + filesDeleted.size();
+Map fileInfo = new HashMap<>(size, 1);
+filesAdded.forEach((fileName, fileSize) -> {
+  // Assert that the file-size of the file being added is positive, since 
Hudi
+  // should not be creating empty files
+  checkState(fileSize > 0);
+  fileInfo.put(handleFileName(fileName, instantTime), new 
HoodieMetadataFileInfo(fileSize, false));
+});
+
+filesDeleted.forEach(fileName -> fileInfo.put(handleFileName(fileName, 
instantTime), DELETE_FILE_METADATA));
 
 HoodieKey key = new HoodieKey(partition, 
MetadataPartitionType.FILES.getPartitionPath());
 HoodieMetadataPayload payload = new 
HoodieMetadataPayload(key.getRecordKey(), METADATA_TYPE_FILE_LIST, fileInfo);
 return new HoodieAvroRecord<>(key, payload);
   }
 
+  /**
+   * In the case where a file was created by something other than a Hudi 
writer, the file name will not contain the commit time. We will prefix the file 
name with hudiext_[commitTime] before storing
+   * in the metadata table. The constructor for {@link 
org.apache.hudi.common.model.HoodieBaseFile} will properly handle this prefix.
+   * @param fileName incoming file name
+   * @param commitTime time of the commit (will be empty during bootstrap 
operations)
+   * @return file name with commit time prefix if the input file name does not 
contain the commit time, otherwise returns the original input
+   */
+  private static String handleFileName(String fileName, Option 
commitTime) {
+return commitTime.map(commit -> {
+  if (fileName.contains(commit) || FSUtils.isLogFile(fileName)) {

Review Comment:
   sounds good.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files

2023-08-05 Thread via GitHub


nsivabalan commented on code in PR #9367:
URL: https://github.com/apache/hudi/pull/9367#discussion_r1285119538


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieBaseFile.java:
##
@@ -61,16 +62,39 @@ public HoodieBaseFile(String filePath) {
   public HoodieBaseFile(String filePath, BaseFile bootstrapBaseFile) {
 super(filePath);
 this.bootstrapBaseFile = Option.ofNullable(bootstrapBaseFile);
-String[] fileIdAndCommitTime = getFileIdAndCommitTimeFromFileName();
+String[] fileIdAndCommitTime = 
getFileIdAndCommitTimeFromFileName(getFileName());
 this.fileId = fileIdAndCommitTime[0];
 this.commitTime = fileIdAndCommitTime[1];
   }
 
+  public HoodieBaseFile(String filePath, String fileId, String commitTime, 
BaseFile bootstrapBaseFile) {
+super(filePath);
+this.bootstrapBaseFile = Option.ofNullable(bootstrapBaseFile);
+this.fileId = fileId;
+this.commitTime = commitTime;
+  }
+
+  private HoodieBaseFile(FileStatus fileStatus, String[] fileIdAndCommitTime, 
BaseFile bootstrapBaseFile) {
+this(fileStatus, fileIdAndCommitTime[0], fileIdAndCommitTime[1], 
bootstrapBaseFile);
+  }
+
+  public HoodieBaseFile(FileStatus fileStatus, String fileId, String 
commitTime, BaseFile bootstrapBaseFile) {
+super(handleExternallyGeneratedFileName(fileStatus, fileId));

Review Comment:
   the naming kind of threw me of. 
   can we name is `mayBeHandleExternallyGeneratedFiles` instead of 
`handleExternallyGeneratedFileName`



##
hudi-common/src/main/java/org/apache/hudi/common/util/ExternalFilePathUtil.java:
##
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+/**
+ * Utility methods for handling externally created files.
+ */
+public class ExternalFilePathUtil {
+  // Suffix acts as a marker when appended to a file path that the path was 
created by an external system and not a Hudi writer.
+  private static final String EXTERNAL_FILE_SUFFIX = "_hudiext";

Review Comment:
   oh, we are changing it to suffix is it. whats the rationale? 



##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java:
##
@@ -324,33 +325,45 @@ public static HoodieRecord 
createPartitionListRecord(List
* @param partitionThe name of the partition
* @param filesAdded   Mapping of files to their sizes for files which have 
been added to this partition
* @param filesDeleted List of files which have been deleted from this 
partition
+   * @param instantTime  Commit time of the commit responsible for adding 
and/or deleting these files, will be empty during bootstrapping of the metadata 
table
*/
   public static HoodieRecord 
createPartitionFilesRecord(String partition,
-   
Option> filesAdded,
-   
Option> filesDeleted) {
-Map fileInfo = new HashMap<>();
-filesAdded.ifPresent(filesMap ->
-fileInfo.putAll(
-filesMap.entrySet().stream().collect(
-Collectors.toMap(Map.Entry::getKey, (entry) -> {
-  long fileSize = entry.getValue();
-  // Assert that the file-size of the file being added is 
positive, since Hudi
-  // should not be creating empty files
-  checkState(fileSize > 0);
-  return new HoodieMetadataFileInfo(fileSize, false);
-})))
-);
-filesDeleted.ifPresent(filesList ->
-fileInfo.putAll(
-filesList.stream().collect(
-Collectors.toMap(Function.identity(), (ignored) -> new 
HoodieMetadataFileInfo(0L, true
-);
+   
Map filesAdded,
+   
List filesDeleted,
+   
Option instantTime) {
+int size = filesAdded.size() + filesDeleted.size();
+Map fileInfo = new HashMap<>(size, 1);
+filesAdded.forEach((fileName, file

[GitHub] [hudi] hudi-bot commented on pull request #9345: [HUDI-6649] Fix column stat based data filtering for MOR

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9345:
URL: https://github.com/apache/hudi/pull/9345#issuecomment-1666588552

   
   ## CI report:
   
   * dbe51932f391f3acbb1a4bfc3bfef2fb7ec62660 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19106)
 
   * 2874d5bf688f588dde8f31d8b266c0a60c3f11bb UNKNOWN
   * 5d45f2e15f3c08faba116d207c72d52f1c185321 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19115)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9276: [HUDI-6635] Hudi Spark Integration Redesign MOR and Bootstrap reading

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9276:
URL: https://github.com/apache/hudi/pull/9276#issuecomment-1666588507

   
   ## CI report:
   
   * 662f3b320ab6ea06462bad9a4448add1ec2f380a UNKNOWN
   * f13bb9ce0b2ba2d264ed8cf639a70bd45ee2d578 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19110)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9345: [HUDI-6649] Fix column stat based data filtering for MOR

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9345:
URL: https://github.com/apache/hudi/pull/9345#issuecomment-1666579827

   
   ## CI report:
   
   * dbe51932f391f3acbb1a4bfc3bfef2fb7ec62660 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19106)
 
   * 2874d5bf688f588dde8f31d8b266c0a60c3f11bb UNKNOWN
   * 5d45f2e15f3c08faba116d207c72d52f1c185321 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9345: [HUDI-6649] Fix column stat based data filtering for MOR

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9345:
URL: https://github.com/apache/hudi/pull/9345#issuecomment-1666578509

   
   ## CI report:
   
   * dbe51932f391f3acbb1a4bfc3bfef2fb7ec62660 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19106)
 
   * 2874d5bf688f588dde8f31d8b266c0a60c3f11bb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9317: [MINOR] Simplify CreateHoodieTableCommand logWarning

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9317:
URL: https://github.com/apache/hudi/pull/9317#issuecomment-1666578493

   
   ## CI report:
   
   * 7ecfd3c7230df1aea3db5a1ffd4463d1b25f16d7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18944)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19111)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9345: [HUDI-6649] Fix column stat based data filtering for MOR

2023-08-05 Thread via GitHub


nsivabalan commented on code in PR #9345:
URL: https://github.com/apache/hudi/pull/9345#discussion_r1285089880


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -106,14 +109,30 @@ case class HoodieFileIndex(spark: SparkSession,
*
* @return List of FileStatus for base files
*/
-  def allFiles: Seq[FileStatus] = {
+  def allBaseFiles: Seq[FileStatus] = {

Review Comment:
   private?



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -122,86 +141,132 @@ case class HoodieFileIndex(spark: SparkSession,
* @return list of PartitionDirectory containing partition to base files 
mapping
*/
   override def listFiles(partitionFilters: Seq[Expression], dataFilters: 
Seq[Expression]): Seq[PartitionDirectory] = {
-// Look up candidate files names in the col-stats index, if all of the 
following conditions are true
-//- Data-skipping is enabled
-//- Col-Stats Index is present
-//- List of predicates (filters) is present
-val candidateFilesNamesOpt: Option[Set[String]] =
+// Prune the partition path by the partition filters
+// NOTE: Non-partitioned tables are assumed to consist from a single 
partition
+//   encompassing the whole table
+val partitionsAndFileSlices = 
getFileSlicesForPrunedPartitions(partitionFilters)
+val partitionsAndFilteredFileSlices = filterFileSlices(dataFilters, 
partitionsAndFileSlices).map {
+  case (partitionOpt, fileSlices) =>
+val allCandidateFiles: Seq[FileStatus] = fileSlices.flatMap(fs => {
+  val baseFileStatusOpt = 
getBaseFileStatus(Option.apply(fs.getBaseFile.orElse(null)))
+  val logFilesStatus = if (includeLogFiles) {
+
fs.getLogFiles.map[FileStatus](JFunction.toJavaFunction[HoodieLogFile, 
FileStatus](lf => lf.getFileStatus))
+  } else {
+java.util.stream.Stream.empty()
+  }
+  val files = 
logFilesStatus.collect(Collectors.toList[FileStatus]).asScala
+  baseFileStatusOpt.foreach(f => files.append(f))
+  files
+})
+
+PartitionDirectory(InternalRow.fromSeq(partitionOpt.get.values), 
allCandidateFiles)
+}
+
+hasPushedDownPartitionPredicates = true
+
+if (shouldReadAsPartitionedTable()) {
+  partitionsAndFilteredFileSlices
+} else {
+  Seq(PartitionDirectory(InternalRow.empty, 
partitionsAndFilteredFileSlices.flatMap(_.files)))
+}
+  }
+
+  def filterFileSlices(dataFilters: Seq[Expression], partitionAndFileSlices: 
Seq[(Option[BaseHoodieTableFileIndex.PartitionPath], Seq[FileSlice])])
+  : Seq[(Option[BaseHoodieTableFileIndex.PartitionPath], Seq[FileSlice])] = {
+// If there are no data filters, return all the file slices.
+// If there are no file slices, return empty list.
+if (partitionAndFileSlices.isEmpty || dataFilters.isEmpty) {
+  partitionAndFileSlices
+} else {
+  // Look up candidate files names in the col-stats index, if all of the 
following conditions are true
+  //- Data-skipping is enabled
+  //- Col-Stats Index is present
+  //- List of predicates (filters) is present
+  val candidateFilesNamesOpt: Option[Set[String]] =

Review Comment:
   lets file a follow up ticket. 
   looks like we are not leveraging the pruned partition filters while polling 
col stats only. 
   Anyways, after we get the list from col stats, we will compare against 
allFils and trim it down properly. So, its not very bad.
   but the amount of data fetched from col stats could be trimmed 



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala:
##
@@ -69,6 +71,8 @@ abstract class BaseMergeOnReadSnapshotRelation(sqlContext: 
SQLContext,
prunedDataSchema: 
Option[StructType])
   extends HoodieBaseRelation(sqlContext, metaClient, optParams, userSchema, 
prunedDataSchema) {
 
+  fileIndex.setIncludeLogFiles(true)

Review Comment:
   I feel this is not elegant and is left open and the caller can invoke 
anytime. 
   Can we introduce a protected method called shouldIncludeLogFiles() to 
HoodieBaseRelation and set the default value to false. 
   
   and we can override that in MergeOnReadSnapshotRelation or 
BaseMergeOnReadSnapshotRelation and set the value to true. 
   
   we can change the constructor or HoodieFileIndex to take an additional arg 
called shouldIncludeLogFiles, and the callers are expected to set it when 
invoking the constructor 
   
   



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -122,86 +141,132 @@ case class HoodieFileIndex(spark: SparkSession,
* @return list of PartitionDirectory containing partition to base files 
mapping
*/
   override def listFiles(partitionFilters: Seq[Expression

[GitHub] [hudi] hudi-bot commented on pull request #9365: [HUDI-6646] Add default lock provider for spark offline compaction an…

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9365:
URL: https://github.com/apache/hudi/pull/9365#issuecomment-1666577154

   
   ## CI report:
   
   * 320ff70e6f0d02b427d9dc868628934f0323fb4e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19109)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] lokeshj1703 commented on pull request #9345: [HUDI-6649] Fix column stat based data filtering for MOR

2023-08-05 Thread via GitHub


lokeshj1703 commented on PR #9345:
URL: https://github.com/apache/hudi/pull/9345#issuecomment-1666574758

   I have reverted the incremental relation changes, the test was passing. My 
bad.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] lokeshj1703 commented on a diff in pull request #9345: [HUDI-6649] Fix column stat based data filtering for MOR

2023-08-05 Thread via GitHub


lokeshj1703 commented on code in PR #9345:
URL: https://github.com/apache/hudi/pull/9345#discussion_r1285100568


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestColumnStatsIndexWithSQL.scala:
##
@@ -0,0 +1,379 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.functional
+
+import org.apache.hudi.DataSourceWriteOptions.{DELETE_OPERATION_OPT_VAL, 
PRECOMBINE_FIELD, RECORDKEY_FIELD}
+import org.apache.hudi.async.SparkAsyncCompactService
+import org.apache.hudi.client.SparkRDDWriteClient
+import org.apache.hudi.client.common.HoodieSparkEngineContext
+import org.apache.hudi.client.utils.MetadataConversionUtils
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.model.{FileSlice, HoodieCommitMetadata, 
HoodieTableType, WriteOperationType}
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.common.table.timeline.HoodieInstant
+import org.apache.hudi.config.{HoodieCompactionConfig, HoodieIndexConfig, 
HoodieWriteConfig}
+import org.apache.hudi.functional.ColumnStatIndexTestBase.ColumnStatsTestCase
+import org.apache.hudi.index.HoodieIndex.IndexType.INMEMORY
+import org.apache.hudi.metadata.HoodieMetadataFileSystemView
+import org.apache.hudi.util.{JFunction, JavaConversions}
+import org.apache.hudi.{DataSourceReadOptions, DataSourceWriteOptions, 
HoodieFileIndex}
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Expression, GreaterThan, Literal}
+import org.apache.spark.sql.types.StringType
+import org.junit.jupiter.api.Assertions.{assertEquals, assertFalse, assertTrue}
+import org.junit.jupiter.api.Disabled
+import org.junit.jupiter.params.ParameterizedTest
+import org.junit.jupiter.params.provider.MethodSource
+
+import java.util.Properties
+import scala.collection.JavaConverters
+
+class TestColumnStatsIndexWithSQL extends ColumnStatIndexTestBase {
+
+  @ParameterizedTest
+  @MethodSource(Array("testMetadataColumnStatsIndexParams"))
+  def testMetadataColumnStatsIndexWithSQL(testCase: ColumnStatsTestCase): Unit 
= {
+val metadataOpts = Map(
+  HoodieMetadataConfig.ENABLE.key -> "true",
+  HoodieMetadataConfig.ENABLE_METADATA_INDEX_COLUMN_STATS.key -> "true"
+)
+
+val commonOpts = Map(
+  "hoodie.insert.shuffle.parallelism" -> "4",
+  "hoodie.upsert.shuffle.parallelism" -> "4",
+  HoodieWriteConfig.TBL_NAME.key -> "hoodie_test",
+  DataSourceWriteOptions.TABLE_TYPE.key -> testCase.tableType.toString,
+  RECORDKEY_FIELD.key -> "c1",
+  PRECOMBINE_FIELD.key -> "c1",
+  HoodieTableConfig.POPULATE_META_FIELDS.key -> "true",
+  DataSourceReadOptions.ENABLE_DATA_SKIPPING.key -> "true",
+  DataSourceReadOptions.QUERY_TYPE.key -> 
DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL
+) ++ metadataOpts
+setupTable(testCase, metadataOpts, commonOpts, shouldValidate = true)
+verifyFileIndexAndSQLQueries(commonOpts)
+  }
+
+  @ParameterizedTest
+  @MethodSource(Array("testMetadataColumnStatsIndexParamsForMOR"))
+  def testMetadataColumnStatsIndexSQLWithInMemoryIndex(testCase: 
ColumnStatsTestCase): Unit = {
+val metadataOpts = Map(
+  HoodieMetadataConfig.ENABLE.key -> "true",
+  HoodieMetadataConfig.ENABLE_METADATA_INDEX_COLUMN_STATS.key -> "true"
+)
+
+val commonOpts = Map(
+  "hoodie.insert.shuffle.parallelism" -> "4",
+  "hoodie.upsert.shuffle.parallelism" -> "4",
+  HoodieWriteConfig.TBL_NAME.key -> "hoodie_test",
+  DataSourceWriteOptions.TABLE_TYPE.key -> testCase.tableType.toString,
+  RECORDKEY_FIELD.key -> "c1",
+  PRECOMBINE_FIELD.key -> "c1",
+  HoodieTableConfig.POPULATE_META_FIELDS.key -> "true",
+  DataSourceReadOptions.ENABLE_DATA_SKIPPING.key -> "true",
+  DataSourceReadOptions.QUERY_TYPE.key -> 
DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL,
+  HoodieIndexConfig.INDEX_TYPE.key() -> INMEMORY.name()
+) ++ metadataOpts
+
+doWriteAndValidateColumnStats(testCase, metadataOpts, commonOpts,
+  dataSourcePath = "index/colstats/input-table-json",
+  expectedColStatsSourcePath = 
"index/colstats/column-stats-ind

[GitHub] [hudi] hudi-bot commented on pull request #9373: [HUDI-6651] Support IN SQL query with Record Index

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9373:
URL: https://github.com/apache/hudi/pull/9373#issuecomment-1666562451

   
   ## CI report:
   
   * 320d28961b879b719401a5a21ce4c54e1d0b1073 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19113)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9373: [HUDI-6651] Support IN SQL query with Record Index

2023-08-05 Thread via GitHub


nsivabalan commented on code in PR #9373:
URL: https://github.com/apache/hudi/pull/9373#discussion_r1285082443


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##
@@ -343,7 +343,7 @@ abstract class HoodieBaseRelation(val sqlContext: 
SQLContext,
*/
   override final def needConversion: Boolean = false
 
-  override def inputFiles: Array[String] = 
fileIndex.allFiles.map(_.getPath.toUri.toString).toArray
+  override def inputFiles: Array[String] = 
fileIndex.allBaseFiles.map(_.getPath.toUri.toString).toArray

Review Comment:
   Are we sure on this change? also, can we add java docs as to whats the 
expected file list here and how it is used. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9374: [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9374:
URL: https://github.com/apache/hudi/pull/9374#issuecomment-1666550244

   
   ## CI report:
   
   * ee5d3013bf73645648b344a468404dd076e80003 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19114)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9373: [HUDI-6651] Support IN SQL query with Record Index

2023-08-05 Thread via GitHub


nsivabalan commented on code in PR #9373:
URL: https://github.com/apache/hudi/pull/9373#discussion_r1285082443


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##
@@ -343,7 +343,7 @@ abstract class HoodieBaseRelation(val sqlContext: 
SQLContext,
*/
   override final def needConversion: Boolean = false
 
-  override def inputFiles: Array[String] = 
fileIndex.allFiles.map(_.getPath.toUri.toString).toArray
+  override def inputFiles: Array[String] = 
fileIndex.allBaseFiles.map(_.getPath.toUri.toString).toArray

Review Comment:
   Are we sure on this change? also, can we add java docs as to whats the 
expected file list here and how it is used. 



##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestRecordLevelIndexWithSQL.scala:
##
@@ -45,12 +52,71 @@ class TestRecordLevelIndexWithSQL extends 
RecordLevelIndexTestBase {
   validate = false)
 
 createTempTable(hudiOpts)
-val reckey = 
mergedDfList.last.limit(1).collect()(0).getAs("_row_key").toString
-spark.sql("select * from " + sqlTempTable + " where '" + reckey + "' = 
_row_key").show(false)
+testInQuery(hudiOpts)
+testEqualToQuery(hudiOpts)
+  }
+
+  def testEqualToQuery(hudiOpts: Map[String, String]): Unit = {
+val reckey = mergedDfList.last.limit(1).collect().map(row => 
row.getAs("_row_key").toString)
+val dataFilter = EqualTo(attribute("_row_key"), Literal(reckey(0)))
+assertEquals(1, spark.sql("select * from " + sqlTempTable + " where " + 
dataFilter.sql).count())
+verifyPruningFileCount(hudiOpts, dataFilter, 1)
+  }
+
+  def testInQuery(hudiOpts: Map[String, String]): Unit = {
+var reckey = mergedDfList.last.limit(1).collect().map(row => 
row.getAs("_row_key").toString)
+var dataFilter = In(attribute("_row_key"), reckey.map(l => 
literal(l)).toList)
+assertEquals(1, spark.sql("select * from " + sqlTempTable + " where " + 
dataFilter.sql).count())
+var numFiles = if (isTableMOR()) 2 else 1
+verifyPruningFileCount(hudiOpts, dataFilter, numFiles)
+
+reckey = mergedDfList.last.limit(2).collect().map(row => 
row.getAs("_row_key").toString)
+dataFilter = In(attribute("_row_key"), reckey.map(l => literal(l)).toList)
+assertEquals(2, spark.sql("select * from " + sqlTempTable + " where " + 
dataFilter.sql).count())
+numFiles = if (isTableMOR()) 2 else 2
+verifyPruningFileCount(hudiOpts, dataFilter, numFiles)
+  }
+
+  private def attribute(partition: String): AttributeReference = {
+AttributeReference(partition, StringType, true)()
+  }
+
+  private def literal(value: String): Literal = {
+Literal.create(value)
+  }
+
+  private def verifyPruningFileCount(opts: Map[String, String], dataFilter: 
Expression, numFiles: Int): Unit = {
+metaClient = HoodieTableMetaClient.reload(metaClient)
+val fileIndex = HoodieFileIndex(spark, metaClient, None, opts + ("path" -> 
basePath))
+fileIndex.setIncludeLogFiles(isTableMOR())
+val filteredPartitionDirectories = fileIndex.listFiles(Seq(), 
Seq(dataFilter))
+val filteredFilesCount = filteredPartitionDirectories.flatMap(s => 
s.files).size
+assertTrue(filteredFilesCount < getLatestDataFilesCount(opts))
+assertEquals(filteredFilesCount, numFiles)
+  }
+
+  private def isTableMOR(): Boolean = {
+metaClient.getTableType == HoodieTableType.MERGE_ON_READ
+  }
+
+  private def getLatestDataFilesCount(opts: Map[String, String], 
includeLogFiles: Boolean = true) = {
+var totalLatestDataFiles = 0L
+
getTableFileSystenView(opts).getAllLatestFileSlicesBeforeOrOn(metaClient.getActiveTimeline.lastInstant().get().getTimestamp)
+  .values()
+  .forEach(JFunction.toJavaConsumer[java.util.stream.Stream[FileSlice]]
+(slices => slices.forEach(JFunction.toJavaConsumer[FileSlice](
+  slice => totalLatestDataFiles += (if (includeLogFiles) 
slice.getLogFiles.count() else 0)
++ (if (slice.getBaseFile.isPresent) 1 else 0)
+totalLatestDataFiles
+  }
+
+  private def getTableFileSystenView(opts: Map[String, String]): 
HoodieMetadataFileSystemView = {
+new HoodieMetadataFileSystemView(metaClient, metaClient.getActiveTimeline, 
metadataWriter(getWriteConfig(opts)).getTableMetadata)
   }
 
   private def createTempTable(hudiOpts: Map[String, String]): Unit = {
 val readDf = spark.read.format("hudi").options(hudiOpts).load(basePath)
+readDf.printSchema()

Review Comment:
   whats the necessity for this ? 



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala:
##
@@ -123,28 +123,47 @@ class RecordLevelIndexSupport(spark: SparkSession,
* @param queryFilters The queries that need to be filtered.
* @return Tuple of List of filtered queries and list of record key literals 
that need to be matched
*/
-  private def filterQueryFiltersWithRecordKey(queryFilters: Seq[Expression]): 
(L

[GitHub] [hudi] hudi-bot commented on pull request #9374: [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9374:
URL: https://github.com/apache/hudi/pull/9374#issuecomment-1666540053

   
   ## CI report:
   
   * ee5d3013bf73645648b344a468404dd076e80003 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9373: [HUDI-6651] Support IN SQL query with Record Index

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9373:
URL: https://github.com/apache/hudi/pull/9373#issuecomment-1666540045

   
   ## CI report:
   
   * 320d28961b879b719401a5a21ce4c54e1d0b1073 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19113)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9287: [HUDI-6592] Flink insert overwrite should support dynamic partition and whole table

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9287:
URL: https://github.com/apache/hudi/pull/9287#issuecomment-1666540011

   
   ## CI report:
   
   * cc2d8c36f8545608607cd3c4bcd481369d668d76 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19107)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] codope commented on a diff in pull request #9345: [HUDI-6649] Fix column stat based data filtering for MOR

2023-08-05 Thread via GitHub


codope commented on code in PR #9345:
URL: https://github.com/apache/hudi/pull/9345#discussion_r1285078625


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/ColumnStatIndexTestBase.scala:
##
@@ -0,0 +1,283 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.functional
+
+import org.apache.hadoop.fs.{LocatedFileStatus, Path}
+import org.apache.hudi.ColumnStatsIndexSupport.composeIndexSchema
+import org.apache.hudi.HoodieConversionUtils.toProperties
+import org.apache.hudi.common.config.{HoodieMetadataConfig, 
HoodieStorageConfig}
+import org.apache.hudi.common.model.HoodieTableType
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import org.apache.hudi.functional.ColumnStatIndexTestBase.ColumnStatsTestCase
+import org.apache.hudi.testutils.HoodieSparkClientTestBase
+import org.apache.hudi.{ColumnStatsIndexSupport, DataSourceWriteOptions}
+import org.apache.spark.sql._
+import org.apache.spark.sql.functions.typedLit
+import org.apache.spark.sql.types._
+import org.junit.jupiter.api.Assertions.assertEquals
+import org.junit.jupiter.api._
+import org.junit.jupiter.params.provider.Arguments
+
+import java.math.BigInteger
+import java.sql.{Date, Timestamp}
+import scala.collection.JavaConverters._
+import scala.util.Random
+
+@Tag("functional")
+class ColumnStatIndexTestBase extends HoodieSparkClientTestBase {
+  var spark: SparkSession = _
+  var dfList: Seq[DataFrame] = Seq()
+
+  val sourceTableSchema =
+new StructType()
+  .add("c1", IntegerType)
+  .add("c2", StringType)
+  .add("c3", DecimalType(9, 3))
+  .add("c4", TimestampType)
+  .add("c5", ShortType)
+  .add("c6", DateType)
+  .add("c7", BinaryType)
+  .add("c8", ByteType)
+
+  @BeforeEach
+  override def setUp() {
+initPath()
+initSparkContexts()
+initFileSystem()
+
+setTableName("hoodie_test")
+initMetaClient()
+
+spark = sqlContext.sparkSession
+  }
+
+  @AfterEach
+  override def tearDown() = {
+cleanupFileSystem()
+cleanupSparkContexts()
+  }
+
+  protected def doWriteAndValidateColumnStats(testCase: ColumnStatsTestCase,
+metadataOpts: Map[String, String],
+hudiOpts: Map[String, String],
+dataSourcePath: String,
+expectedColStatsSourcePath: String,
+operation: String,
+saveMode: SaveMode,
+shouldValidate: Boolean = true): 
Unit = {
+val sourceJSONTablePath = 
getClass.getClassLoader.getResource(dataSourcePath).toString
+
+// NOTE: Schema here is provided for validation that the input date is in 
the appropriate format
+val inputDF = 
spark.read.schema(sourceTableSchema).json(sourceJSONTablePath)
+
+inputDF
+  .sort("c1")
+  .repartition(4, new Column("c1"))
+  .write
+  .format("hudi")
+  .options(hudiOpts)
+  .option(HoodieStorageConfig.PARQUET_MAX_FILE_SIZE.key, 10 * 1024)
+  .option(DataSourceWriteOptions.OPERATION.key, operation)
+  .mode(saveMode)
+  .save(basePath)
+dfList = dfList :+ inputDF
+
+metaClient = HoodieTableMetaClient.reload(metaClient)
+
+if (shouldValidate) {
+  // Currently, routine manually validating the column stats (by actually 
reading every column of every file)
+  // only supports parquet files. Therefore we skip such validation when 
delta-log files are present, and only
+  // validate in following cases: (1) COW: all operations; (2) MOR: insert 
only.
+  val shouldValidateColumnStatsManually = testCase.tableType == 
HoodieTableType.COPY_ON_WRITE ||
+operation.equals(DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+
+  validateColumnStatsIndex(
+testCase, metadataOpts, expectedColStatsSourcePath, 
shouldValidateColumnStatsManually)
+}
+  }
+
+  protected def buildColumnStatsTableManually(tablePath: String,
+includedCols: Seq[String],
+

[GitHub] [hudi] hudi-bot commented on pull request #9373: [HUDI-6651] Support IN SQL query with Record Index

2023-08-05 Thread via GitHub


hudi-bot commented on PR #9373:
URL: https://github.com/apache/hudi/pull/9373#issuecomment-1666538376

   
   ## CI report:
   
   * 320d28961b879b719401a5a21ce4c54e1d0b1073 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >