[GitHub] [hudi] hudi-bot commented on pull request #9317: [MINOR] Simplify CreateHoodieTableCommand logWarning
hudi-bot commented on PR #9317: URL: https://github.com/apache/hudi/pull/9317#issuecomment-1666742342 ## CI report: * 7ecfd3c7230df1aea3db5a1ffd4463d1b25f16d7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18944) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19111) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19122) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync (#9374)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new cecd79e947b [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync (#9374) cecd79e947b is described below commit cecd79e947b3a274a26e72a1f9c730323393f8f9 Author: Sagar Sumit AuthorDate: Sun Aug 6 11:34:46 2023 +0530 [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync (#9374) --- .../hudi/sync/common/util/SyncUtilHelpers.java | 65 +++--- .../hudi/sync/common/util/TestSyncUtilHelpers.java | 64 + 2 files changed, 108 insertions(+), 21 deletions(-) diff --git a/hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SyncUtilHelpers.java b/hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SyncUtilHelpers.java index 466c0c71ffa..440baa86c0c 100644 --- a/hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SyncUtilHelpers.java +++ b/hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SyncUtilHelpers.java @@ -23,7 +23,6 @@ import org.apache.hudi.common.config.TypedProperties; import org.apache.hudi.common.util.ReflectionUtils; import org.apache.hudi.exception.HoodieException; import org.apache.hudi.exception.HoodieMetaSyncException; -import org.apache.hudi.sync.common.HoodieSyncConfig; import org.apache.hudi.sync.common.HoodieSyncTool; import org.apache.hadoop.conf.Configuration; @@ -33,36 +32,56 @@ import org.slf4j.LoggerFactory; import java.util.Map; import java.util.Properties; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.locks.Lock; +import java.util.concurrent.locks.ReentrantLock; + +import static org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT; +import static org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_BASE_PATH; +import static org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_TABLE_NAME; /** * Helper class for syncing Hudi commit data with external metastores. */ public class SyncUtilHelpers { private static final Logger LOG = LoggerFactory.getLogger(SyncUtilHelpers.class); + + // Locks for each table (base path) to avoid concurrent modification of the same underneath meta storage. + // Meta store such as Hive may encounter {@code ConcurrentModificationException} for #alter_table. + private static final ConcurrentHashMap TABLE_LOCKS = new ConcurrentHashMap<>(); + /** * Create an instance of an implementation of {@link HoodieSyncTool} that will sync all the relevant meta information * with an external metastore such as Hive etc. to ensure Hoodie tables can be queried or read via external systems. * - * IMPORTANT: make this method class level thread safe to avoid concurrent modification of the same underneath meta storage. - * Meta store such as Hive may encounter {@code ConcurrentModificationException} for #alter_table. - * - * @param syncToolClassName Class name of the {@link HoodieSyncTool} implementation. - * @param props property map. - * @param hadoopConfigHadoop confs. - * @param fs Filesystem used. - * @param targetBasePath The target base path that contains the hoodie table. - * @param baseFileFormat The file format used by the hoodie table (defaults to PARQUET). + * @param syncToolClassName Class name of the {@link HoodieSyncTool} implementation. + * @param props property map. + * @param hadoopConfig Hadoop confs. + * @param fsFilesystem used. + * @param targetBasePathThe target base path that contains the hoodie table. + * @param baseFileFormatThe file format used by the hoodie table (defaults to PARQUET). */ - public static synchronized void runHoodieMetaSync(String syncToolClassName, + public static void runHoodieMetaSync(String syncToolClassName, TypedProperties props, Configuration hadoopConfig, FileSystem fs, String targetBasePath, String baseFileFormat) { -try (HoodieSyncTool syncTool = instantiateMetaSyncTool(syncToolClassName, props, hadoopConfig, fs, targetBasePath, baseFileFormat)) { - syncTool.syncHoodieTable(); -} catch (Throwable e) { - throw new HoodieMetaSyncException("Could not sync using the meta sync class " + syncToolClassName, e); +if (targetBasePath == null) { + throw new IllegalArgumentException("Target base path must not be null"); +} + +// Get or create a lock for the specific table +Lock tableLock = TABLE_LOCKS.computeIfAbsent(targetBasePath, k -> new Reentrant
[GitHub] [hudi] codope merged pull request #9374: [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync
codope merged PR #9374: URL: https://github.com/apache/hudi/pull/9374 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files
nsivabalan commented on code in PR #9367: URL: https://github.com/apache/hudi/pull/9367#discussion_r1285153000 ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestExternalPathHandling.java: ## @@ -0,0 +1,293 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.client.functional; + +import org.apache.hudi.avro.model.HoodieActionInstant; +import org.apache.hudi.avro.model.HoodieCleanFileInfo; +import org.apache.hudi.avro.model.HoodieCleanMetadata; +import org.apache.hudi.avro.model.HoodieCleanerPlan; +import org.apache.hudi.avro.model.HoodieMetadataColumnStats; +import org.apache.hudi.avro.model.IntWrapper; +import org.apache.hudi.avro.model.StringWrapper; +import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.common.HoodieCleanStat; +import org.apache.hudi.common.config.HoodieMetadataConfig; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.model.HoodieCleaningPolicy; +import org.apache.hudi.common.model.HoodieColumnRangeMetadata; +import org.apache.hudi.common.model.HoodieDeltaWriteStat; +import org.apache.hudi.common.model.HoodieFileGroup; +import org.apache.hudi.common.model.WriteOperationType; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.table.timeline.HoodieTimeline; +import org.apache.hudi.common.table.timeline.TimelineMetadataUtils; +import org.apache.hudi.common.table.view.HoodieTableFileSystemView; +import org.apache.hudi.common.util.CleanerUtils; +import org.apache.hudi.common.util.ExternalFilePathUtil; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.config.HoodieArchivalConfig; +import org.apache.hudi.config.HoodieIndexConfig; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.metadata.HoodieBackedTableMetadata; +import org.apache.hudi.metadata.HoodieMetadataFileSystemView; +import org.apache.hudi.metadata.HoodieTableMetadataWriter; +import org.apache.hudi.table.action.clean.CleanPlanner; +import org.apache.hudi.testutils.HoodieClientTestBase; + +import org.apache.spark.api.java.JavaRDD; +import org.junit.jupiter.api.Assertions; +import org.junit.jupiter.params.ParameterizedTest; +import org.junit.jupiter.params.provider.Arguments; +import org.junit.jupiter.params.provider.MethodSource; + +import java.util.Arrays; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +import static org.apache.hudi.index.HoodieIndex.IndexType.INMEMORY; + +/** + * Asserts that tables initialized from file paths created outside Hudi can properly be loaded. + */ +public class TestExternalPathHandling extends HoodieClientTestBase { + + private static final String FIELD_1 = "field1"; + private static final String FIELD_2 = "field2"; + private HoodieWriteConfig writeConfig; + + @ParameterizedTest + @MethodSource("getArgs") + public void testFlow(FileIdAndNameGenerator fileIdAndNameGenerator, List partitions) throws Exception { +metaClient = HoodieTableMetaClient.reload(metaClient); +writeConfig = HoodieWriteConfig.newBuilder() + .withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(INMEMORY).build()) +.withPath(metaClient.getBasePathV2().toString()) +.withEmbeddedTimelineServerEnabled(false) +.withMetadataConfig(HoodieMetadataConfig.newBuilder() +.withMaxNumDeltaCommitsBeforeCompaction(2) +.withFileSystemBootstrapDisabled(true) +.enable(true) +.withMetadataIndexColumnStats(true) +.withColumnStatsIndexForColumns(FIELD_1 + "," + FIELD_2) +.build()) + .withArchivalConfig(HoodieArchivalConfig.newBuilder().archiveCommitsWith(1, 2).build()) +.withTableServicesEnabled(true) +.build(); + +writeClient = getHoodieWriteClient(writeConfig); +String instantTime1 = writeClient.startCommit(HoodieTimeline.R
[GitHub] [hudi] hudi-bot commented on pull request #9376: [HUDI-6654] Add new log block header type to store record positions
hudi-bot commented on PR #9376: URL: https://github.com/apache/hudi/pull/9376#issuecomment-1666718268 ## CI report: * 156b61cd5e76559df94ddeada2d11d43c29e750c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19125) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9221: [HUDI-6550] Add Hadoop conf to HiveConf for HiveSyncConfig
danny0405 commented on code in PR #9221: URL: https://github.com/apache/hudi/pull/9221#discussion_r1285152739 ## hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java: ## @@ -98,8 +98,9 @@ public HiveSyncConfig(Properties props) { public HiveSyncConfig(Properties props, Configuration hadoopConf) { super(props, hadoopConf); -HiveConf hiveConf = hadoopConf instanceof HiveConf -? (HiveConf) hadoopConf : new HiveConf(hadoopConf, HiveConf.class); +HiveConf hiveConf = new HiveConf(); +// HiveConf needs to load Hadoop conf to allow instantiation via AWSGlueClientFactory +hiveConf.addResource(hadoopConf); Review Comment: > but it's possible that there are other configs/custom configs passed in via Spark session, Is this a classical way people pass around hive options with spark? > An alternative solution would be always pass hadoopConf to HiveConf constructor Does it introduce too much overhead then? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files
nsivabalan commented on code in PR #9367: URL: https://github.com/apache/hudi/pull/9367#discussion_r1285152730 ## hudi-common/src/main/java/org/apache/hudi/common/util/ExternalFilePathUtil.java: ## @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.util; + +/** + * Utility methods for handling externally created files. + */ +public class ExternalFilePathUtil { + // Suffix acts as a marker when appended to a file path that the path was created by an external system and not a Hudi writer. + private static final String EXTERNAL_FILE_SUFFIX = "_hudiext"; Review Comment: gotcha -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files
nsivabalan commented on code in PR #9367: URL: https://github.com/apache/hudi/pull/9367#discussion_r1285152678 ## hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java: ## @@ -312,6 +312,13 @@ public final class HoodieMetadataConfig extends HoodieConfig { .withDocumentation("Maximum size in bytes of a single log file. Larger log files can contain larger log blocks " + "thereby reducing the number of blocks to search for keys"); + public static final ConfigProperty DISABLE_FILESYSTEM_BOOTSTRAP = ConfigProperty + .key(METADATA_PREFIX + ".filesystem.bootstrap.disabled") + .defaultValue(false) + .sinceVersion("0.14.0") + .withDocumentation("Disable bootstrapping metadata table from the file system when the table is first created. " Review Comment: hoodie.metadata.auto.initialize sounds better -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (2b34d3fddbd -> 37dd13642f5)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 2b34d3fddbd [HUDI-6646] Add default lock provider for spark offline compaction and clustering service (#9365) add 37dd13642f5 [HUDI-6386] Fix flakey multiwriter tests (#9369) No new revisions were added by this update. Summary of changes: .../hudi/client/TestHoodieClientMultiWriter.java | 43 -- 1 file changed, 39 insertions(+), 4 deletions(-)
[GitHub] [hudi] nsivabalan merged pull request #9369: [HUDI-6386] Fix flakey multiwriter tests
nsivabalan merged PR #9369: URL: https://github.com/apache/hudi/pull/9369 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #9255: [HUDI-6503] Make TableServiceClient's txnManager consistent with Writ…
danny0405 commented on PR #9255: URL: https://github.com/apache/hudi/pull/9255#issuecomment-1666717120 Let's move this out of 0.14.0 because it's only a code refactoring. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9376: [HUDI-6654] Add new log block header type to store record positions
hudi-bot commented on PR #9376: URL: https://github.com/apache/hudi/pull/9376#issuecomment-1666716359 ## CI report: * 156b61cd5e76559df94ddeada2d11d43c29e750c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9375: [MINOR] Disable the MDT for flink by default
hudi-bot commented on PR #9375: URL: https://github.com/apache/hudi/pull/9375#issuecomment-1666714706 ## CI report: * 99d76ee8e806c5cbf6cffc511e0083a747b3e94c UNKNOWN * 918c99b3221af6f081abd37c9d9494767aab0906 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19123) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files
vinothchandar commented on code in PR #9367: URL: https://github.com/apache/hudi/pull/9367#discussion_r1285149102 ## hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java: ## @@ -440,6 +447,10 @@ public int getRecordIndexMaxParallelism() { return getInt(RECORD_INDEX_MAX_PARALLELISM); } + public boolean isFileSystemBootstrapDisabled() { Review Comment: to be renamed along with config name change. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -616,6 +616,9 @@ private HoodieTableMetaClient initializeMetaClient() throws IOException { * @return List consisting of {@code DirectoryInfo} for each partition found. */ private List listAllPartitionsFromFilesystem(String initializationTime) { +if (dataWriteConfig.getMetadataConfig().isFileSystemBootstrapDisabled()) { Review Comment: food for thought : I think this should prevent any MDT partitions from being initialized, not just `files` ## hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java: ## @@ -312,6 +312,13 @@ public final class HoodieMetadataConfig extends HoodieConfig { .withDocumentation("Maximum size in bytes of a single log file. Larger log files can contain larger log blocks " + "thereby reducing the number of blocks to search for keys"); + public static final ConfigProperty DISABLE_FILESYSTEM_BOOTSTRAP = ConfigProperty + .key(METADATA_PREFIX + ".filesystem.bootstrap.disabled") + .defaultValue(false) + .sinceVersion("0.14.0") + .withDocumentation("Disable bootstrapping metadata table from the file system when the table is first created. " Review Comment: We can mark this advanced. (@yihua to confirm) . On naming : just `hoodie.metadata.bootstrap.disabled` or `hoodie.metadata.auto.initialize` = true (default)| false? Thats the best I can think of. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6654) Add new log block header type to store record positions
[ https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6654: - Labels: pull-request-available (was: ) > Add new log block header type to store record positions > --- > > Key: HUDI-6654 > URL: https://issues.apache.org/jira/browse/HUDI-6654 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.0 > > > To support position-based merging of base and log files, we need to encode > positions in the log blocks so that the positions can be used directly, > without having to deserialize records or delete keys for OverwriteWithLatest > payload, or with ordering values required only for > `DefaultHoodieRecordPayload` supporting event time based streaming. We add a > new `HeaderMetadataType` to store the positions in the log block header. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] yihua opened a new pull request, #9376: [HUDI-6654] Add new log block header type to store record positions
yihua opened a new pull request, #9376: URL: https://github.com/apache/hudi/pull/9376 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6654) Add new log block header type to store record positions
[ https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6654: Summary: Add new log block header type to store record positions (was: Add new log blocker header type to store record positions) > Add new log block header type to store record positions > --- > > Key: HUDI-6654 > URL: https://issues.apache.org/jira/browse/HUDI-6654 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.14.0 > > > To support position-based merging of base and log files, we need to encode > positions in the log blocks so that the positions can be used directly, > without having to deserialize records or delete keys for OverwriteWithLatest > payload, or with ordering values required only for > `DefaultHoodieRecordPayload` supporting event time based streaming. We add a > new `HeaderMetadataType` to store the positions in the log block header. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6654) Add new log blocker header type to store record positions
[ https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6654: Summary: Add new log blocker header type to store record positions (was: Encode record positions in log blocks) > Add new log blocker header type to store record positions > - > > Key: HUDI-6654 > URL: https://issues.apache.org/jira/browse/HUDI-6654 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.14.0 > > > To support position-based merging of base and log files, we need to encode > positions in the log blocks so that the positions can be used directly, > without having to deserialize records or delete keys for OverwriteWithLatest > payload, or with ordering values required only for > `DefaultHoodieRecordPayload` supporting event time based streaming. We add a > new `HeaderMetadataType` to store the positions in the log block header. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6654) Encode record positions in log blocks
[ https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6654: Description: To support position-based merging of base and log files, we need to encode positions in the log blocks so that the positions can be used directly, without having to deserialize records or delete keys for OverwriteWithLatest payload, or with ordering values required only for `DefaultHoodieRecordPayload` supporting event time based streaming. We add a new `HeaderMetadataType` to store the positions in the log block header. (was: To support position-based merging of base and log files, we need to encode positions in the log blocks so that the positions can be used directly, without having to deserialize records or delete keys for OverwriteWithLatest payload, or with ordering values required only for `DefaultHoodieRecordPayload` supporting event time based streaming.) > Encode record positions in log blocks > - > > Key: HUDI-6654 > URL: https://issues.apache.org/jira/browse/HUDI-6654 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.14.0 > > > To support position-based merging of base and log files, we need to encode > positions in the log blocks so that the positions can be used directly, > without having to deserialize records or delete keys for OverwriteWithLatest > payload, or with ordering values required only for > `DefaultHoodieRecordPayload` supporting event time based streaming. We add a > new `HeaderMetadataType` to store the positions in the log block header. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9375: [MINOR] Disable the MDT for flink by default
hudi-bot commented on PR #9375: URL: https://github.com/apache/hudi/pull/9375#issuecomment-1666702367 ## CI report: * 99d76ee8e806c5cbf6cffc511e0083a747b3e94c UNKNOWN * 918c99b3221af6f081abd37c9d9494767aab0906 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9375: [MINOR] Disable the MDT for flink by default
hudi-bot commented on PR #9375: URL: https://github.com/apache/hudi/pull/9375#issuecomment-1666700777 ## CI report: * 99d76ee8e806c5cbf6cffc511e0083a747b3e94c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9374: [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync
hudi-bot commented on PR #9374: URL: https://github.com/apache/hudi/pull/9374#issuecomment-1666700762 ## CI report: * ee5d3013bf73645648b344a468404dd076e80003 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19114) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9317: [MINOR] Simplify CreateHoodieTableCommand logWarning
hudi-bot commented on PR #9317: URL: https://github.com/apache/hudi/pull/9317#issuecomment-1666700714 ## CI report: * 7ecfd3c7230df1aea3db5a1ffd4463d1b25f16d7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18944) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19111) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19122) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cxzl25 commented on pull request #9317: [MINOR] Simplify CreateHoodieTableCommand logWarning
cxzl25 commented on PR #9317: URL: https://github.com/apache/hudi/pull/9317#issuecomment-1666700500 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6646) Add default lock provider for spark offline compaction and clustering service
[ https://issues.apache.org/jira/browse/HUDI-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-6646: - Fix Version/s: 0.14.0 (was: 1.0.0) > Add default lock provider for spark offline compaction and clustering service > - > > Key: HUDI-6646 > URL: https://issues.apache.org/jira/browse/HUDI-6646 > Project: Apache Hudi > Issue Type: Improvement > Components: table-service >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6646) Add default lock provider for spark offline compaction and clustering service
[ https://issues.apache.org/jira/browse/HUDI-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6646. Resolution: Fixed Fixed via master branch: 2b34d3fddbd757a9ff77dd0ad2c67f0d04b03e4f > Add default lock provider for spark offline compaction and clustering service > - > > Key: HUDI-6646 > URL: https://issues.apache.org/jira/browse/HUDI-6646 > Project: Apache Hudi > Issue Type: Improvement > Components: table-service >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-6646] Add default lock provider for spark offline compaction and clustering service (#9365)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 2b34d3fddbd [HUDI-6646] Add default lock provider for spark offline compaction and clustering service (#9365) 2b34d3fddbd is described below commit 2b34d3fddbd757a9ff77dd0ad2c67f0d04b03e4f Author: Danny Chan AuthorDate: Sun Aug 6 11:43:03 2023 +0800 [HUDI-6646] Add default lock provider for spark offline compaction and clustering service (#9365) The default in progress lock provider does not work well for multi-engine interations. Config a file system lock provider which shares the same lock path for engines. --- .../lock/FileSystemBasedLockProvider.java | 25 .../org/apache/hudi/util/FlinkWriteClients.java| 6 +-- .../java/org/apache/hudi/util/StreamerUtil.java| 8 .../scala/org/apache/hudi/HoodieCLIUtils.scala | 9 - .../procedures/RunClusteringProcedure.scala| 9 - .../procedures/RunCompactionProcedure.scala| 8 .../apache/hudi/utilities/HoodieClusteringJob.java | 4 ++ .../org/apache/hudi/utilities/HoodieCompactor.java | 4 ++ .../org/apache/hudi/utilities/UtilHelpers.java | 8 .../org/apache/hudi/utilities/TestUtilHelpers.java | 44 ++ 10 files changed, 110 insertions(+), 15 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/FileSystemBasedLockProvider.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/FileSystemBasedLockProvider.java index dc0f16ecb40..da7e71a2058 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/FileSystemBasedLockProvider.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/FileSystemBasedLockProvider.java @@ -21,6 +21,7 @@ package org.apache.hudi.client.transaction.lock; import org.apache.hudi.common.config.HoodieCommonConfig; import org.apache.hudi.common.config.LockConfiguration; +import org.apache.hudi.common.config.TypedProperties; import org.apache.hudi.common.fs.FSUtils; import org.apache.hudi.common.fs.StorageSchemes; import org.apache.hudi.common.lock.LockProvider; @@ -29,6 +30,7 @@ import org.apache.hudi.common.table.HoodieTableMetaClient; import org.apache.hudi.common.util.FileIOUtils; import org.apache.hudi.common.util.StringUtils; import org.apache.hudi.common.util.ValidationUtils; +import org.apache.hudi.config.HoodieLockConfig; import org.apache.hudi.config.HoodieWriteConfig; import org.apache.hudi.exception.HoodieIOException; import org.apache.hudi.exception.HoodieLockException; @@ -50,6 +52,7 @@ import java.util.concurrent.TimeUnit; import static org.apache.hudi.common.config.LockConfiguration.FILESYSTEM_LOCK_EXPIRE_PROP_KEY; import static org.apache.hudi.common.config.LockConfiguration.FILESYSTEM_LOCK_PATH_PROP_KEY; +import static org.apache.hudi.common.table.HoodieTableMetaClient.AUXILIARYFOLDER_NAME; /** * A FileSystem based lock. This {@link LockProvider} implementation allows to lock table operations @@ -201,4 +204,26 @@ public class FileSystemBasedLockProvider implements LockProvider, Serial || config.getConfig().getString(HoodieWriteConfig.BASE_PATH.key(), null) != null); ValidationUtils.checkArgument(config.getConfig().getInteger(FILESYSTEM_LOCK_EXPIRE_PROP_KEY) >= 0); } + + /** + * Returns a filesystem based lock config with given table path. + */ + public static TypedProperties getLockConfig(String tablePath) { +TypedProperties props = new TypedProperties(); +props.put(HoodieLockConfig.LOCK_PROVIDER_CLASS_NAME.key(), FileSystemBasedLockProvider.class.getName()); +props.put(HoodieLockConfig.LOCK_ACQUIRE_WAIT_TIMEOUT_MS.key(), "2000"); +props.put(HoodieLockConfig.FILESYSTEM_LOCK_EXPIRE.key(), "1"); +props.put(HoodieLockConfig.LOCK_ACQUIRE_CLIENT_NUM_RETRIES.key(), "30"); +props.put(HoodieLockConfig.FILESYSTEM_LOCK_PATH.key(), defaultLockPath(tablePath)); +return props; + } + + /** + * Returns the default lock file root path. + * + * IMPORTANT: this path should be shared especially when there is engine cooperation. + */ + private static String defaultLockPath(String tablePath) { +return tablePath + Path.SEPARATOR + AUXILIARYFOLDER_NAME; + } } diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/FlinkWriteClients.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/FlinkWriteClients.java index 3e84e8493e3..6d8b0d0a7d6 100644 --- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/FlinkWriteClients.java +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/FlinkWriteClients.java @@ -227,12 +227,8 @@
[GitHub] [hudi] danny0405 merged pull request #9365: [HUDI-6646] Add default lock provider for spark offline compaction an…
danny0405 merged PR #9365: URL: https://github.com/apache/hudi/pull/9365 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #9365: [HUDI-6646] Add default lock provider for spark offline compaction an…
danny0405 commented on PR #9365: URL: https://github.com/apache/hudi/pull/9365#issuecomment-199059 Tests have passed: https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=19109&view=results -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9374: [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync
hudi-bot commented on PR #9374: URL: https://github.com/apache/hudi/pull/9374#issuecomment-198867 ## CI report: * ee5d3013bf73645648b344a468404dd076e80003 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6655) Fix TestWriteMergeOnRead#testConsistentBucketIndex
Danny Chen created HUDI-6655: Summary: Fix TestWriteMergeOnRead#testConsistentBucketIndex Key: HUDI-6655 URL: https://issues.apache.org/jira/browse/HUDI-6655 Project: Apache Hudi Issue Type: Bug Reporter: Danny Chen -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table (#9261)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 7061652e1bd [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table (#9261) 7061652e1bd is described below commit 7061652e1bda92037ae796edae3d18fb6ec64529 Author: Sivabalan Narayanan AuthorDate: Sat Aug 5 23:22:46 2023 -0400 [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table (#9261) Adding support for upsert and deletes with spark datasource for pk less table. --- .../hudi/metadata/HoodieMetadataWriteUtils.java| 2 +- .../apache/hudi/AutoRecordKeyGenerationUtils.scala | 6 +- .../main/scala/org/apache/hudi/DefaultSource.scala | 12 +-- .../org/apache/hudi/HoodieCreateRecordUtils.scala | 113 - .../org/apache/hudi/HoodieSparkSqlWriter.scala | 57 ++- .../scala/org/apache/hudi/HoodieWriterUtils.scala | 34 ++- .../TestAutoGenerationOfRecordKeys.scala | 82 ++- 7 files changed, 175 insertions(+), 131 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java index ad87b5287ca..2078896987d 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java @@ -114,7 +114,7 @@ public class HoodieMetadataWriteUtils { // we will trigger archive manually, to ensure only regular writer invokes it .withArchivalConfig(HoodieArchivalConfig.newBuilder() .archiveCommitsWith( -writeConfig.getMinCommitsToKeep(), writeConfig.getMaxCommitsToKeep()) +writeConfig.getMinCommitsToKeep() + 1, writeConfig.getMaxCommitsToKeep() + 1) .withAutoArchive(false) .build()) // we will trigger compaction manually, to control the instant times diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala index ca679acc799..501c563a989 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala @@ -32,7 +32,7 @@ object AutoRecordKeyGenerationUtils { private val log = LoggerFactory.getLogger(getClass) def mayBeValidateParamsForAutoGenerationOfRecordKeys(parameters: Map[String, String], hoodieConfig: HoodieConfig): Unit = { -val autoGenerateRecordKeys = !parameters.contains(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key()) // if record key is not configured, +val autoGenerateRecordKeys = isAutoGenerateRecordKeys(parameters) // hudi will auto generate. if (autoGenerateRecordKeys) { // de-dup is not supported with auto generation of record keys @@ -54,4 +54,8 @@ object AutoRecordKeyGenerationUtils { log.warn("Precombine field " + hoodieConfig.getString(PRECOMBINE_FIELD.key()) + " will be ignored with auto record key generation enabled") } } + + def isAutoGenerateRecordKeys(parameters: Map[String, String]): Boolean = { +!parameters.contains(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key()) // if record key is not configured, + } } diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala index 0f159885d89..3e5cf351ba1 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala @@ -19,7 +19,7 @@ package org.apache.hudi import org.apache.hadoop.fs.Path import org.apache.hudi.DataSourceReadOptions._ -import org.apache.hudi.DataSourceWriteOptions.{BOOTSTRAP_OPERATION_OPT_VAL, SPARK_SQL_WRITES_PREPPED_KEY, OPERATION, STREAMING_CHECKPOINT_IDENTIFIER} +import org.apache.hudi.DataSourceWriteOptions.{BOOTSTRAP_OPERATION_OPT_VAL, OPERATION, RECORDKEY_FIELD, SPARK_SQL_WRITES_PREPPED_KEY, STREAMING_CHECKPOINT_IDENTIFIER} import org.apache.hudi.cdc.CDCRelation import org.apache.hudi.common.fs.FSUtils import org.apache.hudi.common.model.HoodieTableType.{COPY_ON_WRITE, MERGE_ON_READ} @@ -29,7 +29,7 @@ import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} import org.apache.hudi.common.util.ConfigUtils import org.apache.hudi.
[GitHub] [hudi] nsivabalan merged pull request #9261: [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table
nsivabalan merged PR #9261: URL: https://github.com/apache/hudi/pull/9261 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] eric9204 closed pull request #9340: [HUDI-6630] Automatic release connection for hoodie metaserver client
eric9204 closed pull request #9340: [HUDI-6630] Automatic release connection for hoodie metaserver client URL: https://github.com/apache/hudi/pull/9340 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 opened a new pull request, #9375: [MINOR] Disable the MDT for flink by default
danny0405 opened a new pull request, #9375: URL: https://github.com/apache/hudi/pull/9375 ### Change Logs Disable because there are some cross-engine issues that are stil suspended. ### Impact none ### Risk level (write none, low medium or high below) none ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] eric9204 commented on pull request #9340: [HUDI-6630] Automatic release connection for hoodie metaserver client
eric9204 commented on PR #9340: URL: https://github.com/apache/hudi/pull/9340#issuecomment-191795 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 closed pull request #7875: [MINOR] No need to close the metadata writer twice
danny0405 closed pull request #7875: [MINOR] No need to close the metadata writer twice URL: https://github.com/apache/hudi/pull/7875 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7875: [MINOR] No need to close the metadata writer twice
hudi-bot commented on PR #7875: URL: https://github.com/apache/hudi/pull/7875#issuecomment-188590 ## CI report: * f47bd395c4f4569fa19cb99cb4882078ec683aca Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14989) * 4f79281e0bf502a648f48e1fee142cff03ccfe71 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9374: [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync
danny0405 commented on code in PR #9374: URL: https://github.com/apache/hudi/pull/9374#discussion_r1285145028 ## hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SyncUtilHelpers.java: ## @@ -33,36 +32,56 @@ import java.util.Map; import java.util.Properties; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.locks.Lock; +import java.util.concurrent.locks.ReentrantLock; + +import static org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT; +import static org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_BASE_PATH; +import static org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_TABLE_NAME; /** * Helper class for syncing Hudi commit data with external metastores. */ public class SyncUtilHelpers { private static final Logger LOG = LoggerFactory.getLogger(SyncUtilHelpers.class); + + // Locks for each table (base path) to avoid concurrent modification of the same underneath meta storage. + // Meta store such as Hive may encounter {@code ConcurrentModificationException} for #alter_table. + private static final ConcurrentHashMap TABLE_LOCKS = new ConcurrentHashMap<>(); + /** * Create an instance of an implementation of {@link HoodieSyncTool} that will sync all the relevant meta information * with an external metastore such as Hive etc. to ensure Hoodie tables can be queried or read via external systems. * - * IMPORTANT: make this method class level thread safe to avoid concurrent modification of the same underneath meta storage. - * Meta store such as Hive may encounter {@code ConcurrentModificationException} for #alter_table. - * - * @param syncToolClassName Class name of the {@link HoodieSyncTool} implementation. - * @param props property map. - * @param hadoopConfigHadoop confs. - * @param fs Filesystem used. - * @param targetBasePath The target base path that contains the hoodie table. - * @param baseFileFormat The file format used by the hoodie table (defaults to PARQUET). + * @param syncToolClassName Class name of the {@link HoodieSyncTool} implementation. + * @param props property map. + * @param hadoopConfig Hadoop confs. + * @param fsFilesystem used. + * @param targetBasePathThe target base path that contains the hoodie table. + * @param baseFileFormatThe file format used by the hoodie table (defaults to PARQUET). */ - public static synchronized void runHoodieMetaSync(String syncToolClassName, + public static void runHoodieMetaSync(String syncToolClassName, TypedProperties props, Configuration hadoopConfig, FileSystem fs, String targetBasePath, String baseFileFormat) { -try (HoodieSyncTool syncTool = instantiateMetaSyncTool(syncToolClassName, props, hadoopConfig, fs, targetBasePath, baseFileFormat)) { - syncTool.syncHoodieTable(); -} catch (Throwable e) { - throw new HoodieMetaSyncException("Could not sync using the meta sync class " + syncToolClassName, e); +if (targetBasePath == null) { + throw new IllegalArgumentException("Target base path must not be null"); +} + +// Get or create a lock for the specific table +Lock tableLock = TABLE_LOCKS.computeIfAbsent(targetBasePath, k -> new ReentrantLock()); Review Comment: yeah, we can always extend it with more sophisticated requests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9280: [HUDI-6587] Check incomplete commit for time travel query
nsivabalan commented on code in PR #9280: URL: https://github.com/apache/hudi/pull/9280#discussion_r1285144748 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java: ## @@ -315,6 +318,29 @@ public static Option getEarliestInstantForMetadataArchival( } } + /** + * Validate user-specified timestamp of time travel query against incomplete commit's timestamp. + * + * @throws HoodieException when time travel query's timestamp >= incomplete commit's timestamp + */ + public static void validateTimestampAsOf(HoodieTableMetaClient metaClient, String timestampAsOf) { Review Comment: are we not supporting the 3 enum values? looks like we are going w/ FAIL directly ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9367: [HUDI-6648] Support building table views from existing files
hudi-bot commented on PR #9367: URL: https://github.com/apache/hudi/pull/9367#issuecomment-185745 ## CI report: * cfb02dbb0dd38da5a1e375ceb3436baacc53ad49 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19118) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #9071: [HUDI-6453] Cascade Glue schema changes to partitions
nsivabalan commented on PR #9071: URL: https://github.com/apache/hudi/pull/9071#issuecomment-182663 hey @CTTY @parisni : We are looking to cut branch for 0.14.0 in 1 day. If you can coordinate and land the patch, would be good and we can pull it in. CC @prashantwason -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality
hudi-bot commented on PR #9371: URL: https://github.com/apache/hudi/pull/9371#issuecomment-175249 ## CI report: * 0e39684d85c18aef48131ba838ffc63e48b5fcf2 UNKNOWN * 5bc292aa241347e9a985e70e1335a9d7309d51aa Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19117) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on a diff in pull request #9374: [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync
codope commented on code in PR #9374: URL: https://github.com/apache/hudi/pull/9374#discussion_r1285138372 ## hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SyncUtilHelpers.java: ## @@ -33,36 +32,56 @@ import java.util.Map; import java.util.Properties; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.locks.Lock; +import java.util.concurrent.locks.ReentrantLock; + +import static org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT; +import static org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_BASE_PATH; +import static org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_TABLE_NAME; /** * Helper class for syncing Hudi commit data with external metastores. */ public class SyncUtilHelpers { private static final Logger LOG = LoggerFactory.getLogger(SyncUtilHelpers.class); + + // Locks for each table (base path) to avoid concurrent modification of the same underneath meta storage. + // Meta store such as Hive may encounter {@code ConcurrentModificationException} for #alter_table. + private static final ConcurrentHashMap TABLE_LOCKS = new ConcurrentHashMap<>(); + /** * Create an instance of an implementation of {@link HoodieSyncTool} that will sync all the relevant meta information * with an external metastore such as Hive etc. to ensure Hoodie tables can be queried or read via external systems. * - * IMPORTANT: make this method class level thread safe to avoid concurrent modification of the same underneath meta storage. - * Meta store such as Hive may encounter {@code ConcurrentModificationException} for #alter_table. - * - * @param syncToolClassName Class name of the {@link HoodieSyncTool} implementation. - * @param props property map. - * @param hadoopConfigHadoop confs. - * @param fs Filesystem used. - * @param targetBasePath The target base path that contains the hoodie table. - * @param baseFileFormat The file format used by the hoodie table (defaults to PARQUET). + * @param syncToolClassName Class name of the {@link HoodieSyncTool} implementation. + * @param props property map. + * @param hadoopConfig Hadoop confs. + * @param fsFilesystem used. + * @param targetBasePathThe target base path that contains the hoodie table. + * @param baseFileFormatThe file format used by the hoodie table (defaults to PARQUET). */ - public static synchronized void runHoodieMetaSync(String syncToolClassName, + public static void runHoodieMetaSync(String syncToolClassName, TypedProperties props, Configuration hadoopConfig, FileSystem fs, String targetBasePath, String baseFileFormat) { -try (HoodieSyncTool syncTool = instantiateMetaSyncTool(syncToolClassName, props, hadoopConfig, fs, targetBasePath, baseFileFormat)) { - syncTool.syncHoodieTable(); -} catch (Throwable e) { - throw new HoodieMetaSyncException("Could not sync using the meta sync class " + syncToolClassName, e); +if (targetBasePath == null) { + throw new IllegalArgumentException("Target base path must not be null"); +} + +// Get or create a lock for the specific table +Lock tableLock = TABLE_LOCKS.computeIfAbsent(targetBasePath, k -> new ReentrantLock()); Review Comment: I noticed that for the same table different sync tools run one after the other, so did not see the need to concat tool class to the key. In future, if we plan to run everything concurrently, we can add tool class the the key. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6592) Flink insert overwrite should support dynamic partition instead of whole table
[ https://issues.apache.org/jira/browse/HUDI-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-6592: - Fix Version/s: 0.14.0 > Flink insert overwrite should support dynamic partition instead of whole table > -- > > Key: HUDI-6592 > URL: https://issues.apache.org/jira/browse/HUDI-6592 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.14.0 >Reporter: Nicholas Jiang >Assignee: Nicholas Jiang >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > Flink insert overwrite should support dynamic partition instead of the whole > table, which behavior is consistent with the semantics of insert overwrite in > Flink. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6592) Flink insert overwrite should support dynamic partition instead of whole table
[ https://issues.apache.org/jira/browse/HUDI-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6592. Resolution: Fixed Fixed via master branch: 8c6e8d92cad465925d5bf165deccc18229efcd1f > Flink insert overwrite should support dynamic partition instead of whole table > -- > > Key: HUDI-6592 > URL: https://issues.apache.org/jira/browse/HUDI-6592 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.14.0 >Reporter: Nicholas Jiang >Assignee: Nicholas Jiang >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > Flink insert overwrite should support dynamic partition instead of the whole > table, which behavior is consistent with the semantics of insert overwrite in > Flink. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-6592] Flink insert overwrite should support dynamic partition and whole table (#9287)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 8c6e8d92cad [HUDI-6592] Flink insert overwrite should support dynamic partition and whole table (#9287) 8c6e8d92cad is described below commit 8c6e8d92cad465925d5bf165deccc18229efcd1f Author: Nicholas Jiang AuthorDate: Sun Aug 6 08:15:19 2023 +0800 [HUDI-6592] Flink insert overwrite should support dynamic partition and whole table (#9287) --- .../apache/hudi/configuration/FlinkOptions.java| 11 +++ .../apache/hudi/configuration/OptionsResolver.java | 9 ++ .../sink/overwrite/PartitionOverwriteMode.java | 35 ++ .../org/apache/hudi/table/HoodieTableSink.java | 11 --- .../apache/hudi/table/ITTestHoodieDataSource.java | 27 + .../test/java/org/apache/hudi/utils/TestData.java | 16 ++ 6 files changed, 99 insertions(+), 10 deletions(-) diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java index c140d40af88..556d0b2ef2b 100644 --- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java @@ -39,6 +39,7 @@ import org.apache.hudi.hive.ddl.HiveSyncMode; import org.apache.hudi.index.HoodieIndex; import org.apache.hudi.keygen.constant.KeyGeneratorOptions; import org.apache.hudi.keygen.constant.KeyGeneratorType; +import org.apache.hudi.sink.overwrite.PartitionOverwriteMode; import org.apache.hudi.table.action.cluster.ClusteringPlanPartitionFilterMode; import org.apache.hudi.util.ClientIds; @@ -613,6 +614,16 @@ public class FlinkOptions extends HoodieConfig { .defaultValue(128) .withDescription("Sort memory in MB, default 128MB"); + @AdvancedConfig + public static final ConfigOption WRITE_PARTITION_OVERWRITE_MODE = ConfigOptions + .key("write.partition.overwrite.mode") + .stringType() + .defaultValue(PartitionOverwriteMode.STATIC.name()) + .withDescription("When INSERT OVERWRITE a partitioned data source table, we currently support 2 modes: static and dynamic. " + + "Static mode deletes all the partitions that match the partition specification(e.g. PARTITION(a=1,b)) in the INSERT statement, before overwriting. " + + "Dynamic mode doesn't delete partitions ahead, and only overwrite those partitions that have data written into it at runtime. " + + "By default we use static mode to keep the same behavior of previous version."); + // this is only for internal use @AdvancedConfig public static final ConfigOption WRITE_CLIENT_ID = ConfigOptions diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java index 944e795dc2f..bfde0b0e2b0 100644 --- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java @@ -35,6 +35,7 @@ import org.apache.hudi.config.HoodieWriteConfig; import org.apache.hudi.exception.HoodieException; import org.apache.hudi.index.HoodieIndex; import org.apache.hudi.keygen.constant.KeyGeneratorOptions; +import org.apache.hudi.sink.overwrite.PartitionOverwriteMode; import org.apache.hudi.table.format.FilePathUtils; import org.apache.flink.configuration.ConfigOption; @@ -241,6 +242,14 @@ public class OptionsResolver { || conf.getString(FlinkOptions.OPERATION).equalsIgnoreCase(WriteOperationType.INSERT_OVERWRITE.value()); } + /** + * Returns whether the operation is INSERT OVERWRITE dynamic partition. + */ + public static boolean overwriteDynamicPartition(Configuration conf) { +return conf.getString(FlinkOptions.OPERATION).equalsIgnoreCase(WriteOperationType.INSERT_OVERWRITE.value()) +|| conf.getString(FlinkOptions.WRITE_PARTITION_OVERWRITE_MODE).equalsIgnoreCase(PartitionOverwriteMode.DYNAMIC.name()); + } + /** * Returns whether the read start commit is specific commit timestamp. */ diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/overwrite/PartitionOverwriteMode.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/overwrite/PartitionOverwriteMode.java new file mode 100644 index 000..700c7432e25 --- /dev/null +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/overwrite/PartitionOverwriteMode.java @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) unde
[GitHub] [hudi] danny0405 merged pull request #9287: [HUDI-6592] Flink insert overwrite should support dynamic partition and whole table
danny0405 merged PR #9287: URL: https://github.com/apache/hudi/pull/9287 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6654) Encode record positions in log blocks
[ https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6654: Description: To support position-based merging of base and log files, we need to encode positions in the log blocks so that the positions can be used directly, without having to deserialize records or delete keys for OverwriteWithLatest payload, or with ordering values required only for `DefaultHoodieRecordPayload` supporting event time based streaming. > Encode record positions in log blocks > - > > Key: HUDI-6654 > URL: https://issues.apache.org/jira/browse/HUDI-6654 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.14.0 > > > To support position-based merging of base and log files, we need to encode > positions in the log blocks so that the positions can be used directly, > without having to deserialize records or delete keys for OverwriteWithLatest > payload, or with ordering values required only for > `DefaultHoodieRecordPayload` supporting event time based streaming. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9261: [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table
hudi-bot commented on PR #9261: URL: https://github.com/apache/hudi/pull/9261#issuecomment-142088 ## CI report: * edd7d00172e57421cc50dcfcc2c90e4b909f67bd Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19116) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-6654) Encode record positions in log blocks
[ https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-6654: --- Assignee: Ethan Guo > Encode record positions in log blocks > - > > Key: HUDI-6654 > URL: https://issues.apache.org/jira/browse/HUDI-6654 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6653) Support position-based merging of base and log files
[ https://issues.apache.org/jira/browse/HUDI-6653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-6653: --- Assignee: Ethan Guo > Support position-based merging of base and log files > > > Key: HUDI-6653 > URL: https://issues.apache.org/jira/browse/HUDI-6653 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6654) Encode record positions in log blocks
Ethan Guo created HUDI-6654: --- Summary: Encode record positions in log blocks Key: HUDI-6654 URL: https://issues.apache.org/jira/browse/HUDI-6654 Project: Apache Hudi Issue Type: New Feature Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6654) Encode record positions in log blocks
[ https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6654: Fix Version/s: 0.14.0 > Encode record positions in log blocks > - > > Key: HUDI-6654 > URL: https://issues.apache.org/jira/browse/HUDI-6654 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Priority: Major > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6654) Encode record positions in log blocks
[ https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6654: Priority: Blocker (was: Major) > Encode record positions in log blocks > - > > Key: HUDI-6654 > URL: https://issues.apache.org/jira/browse/HUDI-6654 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Priority: Blocker > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6653) Support position-based merging of base and log files
Ethan Guo created HUDI-6653: --- Summary: Support position-based merging of base and log files Key: HUDI-6653 URL: https://issues.apache.org/jira/browse/HUDI-6653 Project: Apache Hudi Issue Type: New Feature Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9367: [HUDI-6648] Support building table views from existing files
hudi-bot commented on PR #9367: URL: https://github.com/apache/hudi/pull/9367#issuecomment-132116 ## CI report: * 3da7a8536e1c4bb4a7450d59b0ad32e9ed048c20 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19101) * cfb02dbb0dd38da5a1e375ceb3436baacc53ad49 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19118) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality
hudi-bot commented on PR #9371: URL: https://github.com/apache/hudi/pull/9371#issuecomment-132137 ## CI report: * 0e39684d85c18aef48131ba838ffc63e48b5fcf2 UNKNOWN * 33bbbdb4142bf4eeae9696b3f3d1e795e6970ae3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19103) * 5bc292aa241347e9a985e70e1335a9d7309d51aa Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19117) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality
hudi-bot commented on PR #9371: URL: https://github.com/apache/hudi/pull/9371#issuecomment-130313 ## CI report: * 0e39684d85c18aef48131ba838ffc63e48b5fcf2 UNKNOWN * 33bbbdb4142bf4eeae9696b3f3d1e795e6970ae3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19103) * 5bc292aa241347e9a985e70e1335a9d7309d51aa UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9367: [HUDI-6648] Support building table views from existing files
hudi-bot commented on PR #9367: URL: https://github.com/apache/hudi/pull/9367#issuecomment-130298 ## CI report: * 3da7a8536e1c4bb4a7450d59b0ad32e9ed048c20 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19101) * cfb02dbb0dd38da5a1e375ceb3436baacc53ad49 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9374: [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync
the-other-tim-brown commented on code in PR #9374: URL: https://github.com/apache/hudi/pull/9374#discussion_r1285129506 ## hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SyncUtilHelpers.java: ## @@ -33,36 +32,56 @@ import java.util.Map; import java.util.Properties; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.locks.Lock; +import java.util.concurrent.locks.ReentrantLock; + +import static org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT; +import static org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_BASE_PATH; +import static org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_TABLE_NAME; /** * Helper class for syncing Hudi commit data with external metastores. */ public class SyncUtilHelpers { private static final Logger LOG = LoggerFactory.getLogger(SyncUtilHelpers.class); + + // Locks for each table (base path) to avoid concurrent modification of the same underneath meta storage. + // Meta store such as Hive may encounter {@code ConcurrentModificationException} for #alter_table. + private static final ConcurrentHashMap TABLE_LOCKS = new ConcurrentHashMap<>(); + /** * Create an instance of an implementation of {@link HoodieSyncTool} that will sync all the relevant meta information * with an external metastore such as Hive etc. to ensure Hoodie tables can be queried or read via external systems. * - * IMPORTANT: make this method class level thread safe to avoid concurrent modification of the same underneath meta storage. - * Meta store such as Hive may encounter {@code ConcurrentModificationException} for #alter_table. - * - * @param syncToolClassName Class name of the {@link HoodieSyncTool} implementation. - * @param props property map. - * @param hadoopConfigHadoop confs. - * @param fs Filesystem used. - * @param targetBasePath The target base path that contains the hoodie table. - * @param baseFileFormat The file format used by the hoodie table (defaults to PARQUET). + * @param syncToolClassName Class name of the {@link HoodieSyncTool} implementation. + * @param props property map. + * @param hadoopConfig Hadoop confs. + * @param fsFilesystem used. + * @param targetBasePathThe target base path that contains the hoodie table. + * @param baseFileFormatThe file format used by the hoodie table (defaults to PARQUET). */ - public static synchronized void runHoodieMetaSync(String syncToolClassName, + public static void runHoodieMetaSync(String syncToolClassName, TypedProperties props, Configuration hadoopConfig, FileSystem fs, String targetBasePath, String baseFileFormat) { -try (HoodieSyncTool syncTool = instantiateMetaSyncTool(syncToolClassName, props, hadoopConfig, fs, targetBasePath, baseFileFormat)) { - syncTool.syncHoodieTable(); -} catch (Throwable e) { - throw new HoodieMetaSyncException("Could not sync using the meta sync class " + syncToolClassName, e); +if (targetBasePath == null) { + throw new IllegalArgumentException("Target base path must not be null"); +} + +// Get or create a lock for the specific table +Lock tableLock = TABLE_LOCKS.computeIfAbsent(targetBasePath, k -> new ReentrantLock()); Review Comment: What are your thoughts on keying off of the table base path and the syncToolClassName to allow concurrent updates to multiple meta syncs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files
the-other-tim-brown commented on code in PR #9367: URL: https://github.com/apache/hudi/pull/9367#discussion_r1285129324 ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestExternalPathHandling.java: ## @@ -0,0 +1,293 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.client.functional; + +import org.apache.hudi.avro.model.HoodieActionInstant; +import org.apache.hudi.avro.model.HoodieCleanFileInfo; +import org.apache.hudi.avro.model.HoodieCleanMetadata; +import org.apache.hudi.avro.model.HoodieCleanerPlan; +import org.apache.hudi.avro.model.HoodieMetadataColumnStats; +import org.apache.hudi.avro.model.IntWrapper; +import org.apache.hudi.avro.model.StringWrapper; +import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.common.HoodieCleanStat; +import org.apache.hudi.common.config.HoodieMetadataConfig; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.model.HoodieCleaningPolicy; +import org.apache.hudi.common.model.HoodieColumnRangeMetadata; +import org.apache.hudi.common.model.HoodieDeltaWriteStat; +import org.apache.hudi.common.model.HoodieFileGroup; +import org.apache.hudi.common.model.WriteOperationType; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.table.timeline.HoodieTimeline; +import org.apache.hudi.common.table.timeline.TimelineMetadataUtils; +import org.apache.hudi.common.table.view.HoodieTableFileSystemView; +import org.apache.hudi.common.util.CleanerUtils; +import org.apache.hudi.common.util.ExternalFilePathUtil; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.config.HoodieArchivalConfig; +import org.apache.hudi.config.HoodieIndexConfig; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.metadata.HoodieBackedTableMetadata; +import org.apache.hudi.metadata.HoodieMetadataFileSystemView; +import org.apache.hudi.metadata.HoodieTableMetadataWriter; +import org.apache.hudi.table.action.clean.CleanPlanner; +import org.apache.hudi.testutils.HoodieClientTestBase; + +import org.apache.spark.api.java.JavaRDD; +import org.junit.jupiter.api.Assertions; +import org.junit.jupiter.params.ParameterizedTest; +import org.junit.jupiter.params.provider.Arguments; +import org.junit.jupiter.params.provider.MethodSource; + +import java.util.Arrays; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +import static org.apache.hudi.index.HoodieIndex.IndexType.INMEMORY; + +/** + * Asserts that tables initialized from file paths created outside Hudi can properly be loaded. + */ +public class TestExternalPathHandling extends HoodieClientTestBase { + + private static final String FIELD_1 = "field1"; + private static final String FIELD_2 = "field2"; + private HoodieWriteConfig writeConfig; + + @ParameterizedTest + @MethodSource("getArgs") + public void testFlow(FileIdAndNameGenerator fileIdAndNameGenerator, List partitions) throws Exception { +metaClient = HoodieTableMetaClient.reload(metaClient); +writeConfig = HoodieWriteConfig.newBuilder() + .withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(INMEMORY).build()) +.withPath(metaClient.getBasePathV2().toString()) +.withEmbeddedTimelineServerEnabled(false) +.withMetadataConfig(HoodieMetadataConfig.newBuilder() +.withMaxNumDeltaCommitsBeforeCompaction(2) +.withFileSystemBootstrapDisabled(true) +.enable(true) +.withMetadataIndexColumnStats(true) +.withColumnStatsIndexForColumns(FIELD_1 + "," + FIELD_2) +.build()) + .withArchivalConfig(HoodieArchivalConfig.newBuilder().archiveCommitsWith(1, 2).build()) +.withTableServicesEnabled(true) +.build(); + +writeClient = getHoodieWriteClient(writeConfig); +String instantTime1 = writeClient.startCommit(HoodieT
[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files
the-other-tim-brown commented on code in PR #9367: URL: https://github.com/apache/hudi/pull/9367#discussion_r1285129179 ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodieBaseFile.java: ## @@ -61,16 +62,39 @@ public HoodieBaseFile(String filePath) { public HoodieBaseFile(String filePath, BaseFile bootstrapBaseFile) { super(filePath); this.bootstrapBaseFile = Option.ofNullable(bootstrapBaseFile); -String[] fileIdAndCommitTime = getFileIdAndCommitTimeFromFileName(); +String[] fileIdAndCommitTime = getFileIdAndCommitTimeFromFileName(getFileName()); this.fileId = fileIdAndCommitTime[0]; this.commitTime = fileIdAndCommitTime[1]; } + public HoodieBaseFile(String filePath, String fileId, String commitTime, BaseFile bootstrapBaseFile) { +super(filePath); +this.bootstrapBaseFile = Option.ofNullable(bootstrapBaseFile); +this.fileId = fileId; +this.commitTime = commitTime; + } + + private HoodieBaseFile(FileStatus fileStatus, String[] fileIdAndCommitTime, BaseFile bootstrapBaseFile) { +this(fileStatus, fileIdAndCommitTime[0], fileIdAndCommitTime[1], bootstrapBaseFile); + } + + public HoodieBaseFile(FileStatus fileStatus, String fileId, String commitTime, BaseFile bootstrapBaseFile) { +super(handleExternallyGeneratedFileName(fileStatus, fileId)); Review Comment: Ok makes sense, updating the naming and adding a java doc for more clarity for future developers -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files
the-other-tim-brown commented on code in PR #9367: URL: https://github.com/apache/hudi/pull/9367#discussion_r1285128589 ## hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java: ## @@ -312,6 +312,13 @@ public final class HoodieMetadataConfig extends HoodieConfig { .withDocumentation("Maximum size in bytes of a single log file. Larger log files can contain larger log blocks " + "thereby reducing the number of blocks to search for keys"); + public static final ConfigProperty DISABLE_FILESYSTEM_BOOTSTRAP = ConfigProperty + .key(METADATA_PREFIX + ".filesystem.bootstrap.disabled") + .defaultValue(false) + .sinceVersion("0.14.0") + .withDocumentation("Disable bootstrapping metadata table from the file system when the table is first created. " Review Comment: Any more thoughts here? I'm not sure about naming conventions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files
the-other-tim-brown commented on code in PR #9367: URL: https://github.com/apache/hudi/pull/9367#discussion_r1285128491 ## hudi-common/src/main/java/org/apache/hudi/common/util/ExternalFilePathUtil.java: ## @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.util; + +/** + * Utility methods for handling externally created files. + */ +public class ExternalFilePathUtil { + // Suffix acts as a marker when appended to a file path that the path was created by an external system and not a Hudi writer. + private static final String EXTERNAL_FILE_SUFFIX = "_hudiext"; Review Comment: For ease of appending this to the path in the write stats that are passed in the commit. Now we go from something like `partition1/file_1.parquet` to `partition1/file_1.parquet_20230805150812_hudiext` instead of `partition1/hudi_ext_20230805150812_file_1.parquet`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files
the-other-tim-brown commented on code in PR #9367: URL: https://github.com/apache/hudi/pull/9367#discussion_r1285128263 ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestExternalPathHandling.java: ## @@ -0,0 +1,293 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.client.functional; + +import org.apache.hudi.avro.model.HoodieActionInstant; +import org.apache.hudi.avro.model.HoodieCleanFileInfo; +import org.apache.hudi.avro.model.HoodieCleanMetadata; +import org.apache.hudi.avro.model.HoodieCleanerPlan; +import org.apache.hudi.avro.model.HoodieMetadataColumnStats; +import org.apache.hudi.avro.model.IntWrapper; +import org.apache.hudi.avro.model.StringWrapper; +import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.common.HoodieCleanStat; +import org.apache.hudi.common.config.HoodieMetadataConfig; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.model.HoodieCleaningPolicy; +import org.apache.hudi.common.model.HoodieColumnRangeMetadata; +import org.apache.hudi.common.model.HoodieDeltaWriteStat; +import org.apache.hudi.common.model.HoodieFileGroup; +import org.apache.hudi.common.model.WriteOperationType; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.table.timeline.HoodieTimeline; +import org.apache.hudi.common.table.timeline.TimelineMetadataUtils; +import org.apache.hudi.common.table.view.HoodieTableFileSystemView; +import org.apache.hudi.common.util.CleanerUtils; +import org.apache.hudi.common.util.ExternalFilePathUtil; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.config.HoodieArchivalConfig; +import org.apache.hudi.config.HoodieIndexConfig; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.metadata.HoodieBackedTableMetadata; +import org.apache.hudi.metadata.HoodieMetadataFileSystemView; +import org.apache.hudi.metadata.HoodieTableMetadataWriter; +import org.apache.hudi.table.action.clean.CleanPlanner; +import org.apache.hudi.testutils.HoodieClientTestBase; + +import org.apache.spark.api.java.JavaRDD; +import org.junit.jupiter.api.Assertions; +import org.junit.jupiter.params.ParameterizedTest; +import org.junit.jupiter.params.provider.Arguments; +import org.junit.jupiter.params.provider.MethodSource; + +import java.util.Arrays; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +import static org.apache.hudi.index.HoodieIndex.IndexType.INMEMORY; + +/** + * Asserts that tables initialized from file paths created outside Hudi can properly be loaded. + */ +public class TestExternalPathHandling extends HoodieClientTestBase { + + private static final String FIELD_1 = "field1"; + private static final String FIELD_2 = "field2"; + private HoodieWriteConfig writeConfig; + + @ParameterizedTest + @MethodSource("getArgs") + public void testFlow(FileIdAndNameGenerator fileIdAndNameGenerator, List partitions) throws Exception { +metaClient = HoodieTableMetaClient.reload(metaClient); +writeConfig = HoodieWriteConfig.newBuilder() + .withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(INMEMORY).build()) +.withPath(metaClient.getBasePathV2().toString()) +.withEmbeddedTimelineServerEnabled(false) +.withMetadataConfig(HoodieMetadataConfig.newBuilder() +.withMaxNumDeltaCommitsBeforeCompaction(2) +.withFileSystemBootstrapDisabled(true) +.enable(true) +.withMetadataIndexColumnStats(true) +.withColumnStatsIndexForColumns(FIELD_1 + "," + FIELD_2) +.build()) + .withArchivalConfig(HoodieArchivalConfig.newBuilder().archiveCommitsWith(1, 2).build()) +.withTableServicesEnabled(true) +.build(); + +writeClient = getHoodieWriteClient(writeConfig); +String instantTime1 = writeClient.startCommit(HoodieT
[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality
the-other-tim-brown commented on code in PR #9371: URL: https://github.com/apache/hudi/pull/9371#discussion_r1285128085 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java: ## @@ -285,7 +373,29 @@ public HoodieWriteMetadata logCompact(String logCompactionInstantTime) { * Commit Log Compaction and track metrics. */ protected void completeLogCompaction(HoodieCommitMetadata metadata, HoodieTable table, String logCompactionCommitTime) { -throw new UnsupportedOperationException("Log compaction is not supported yet."); +this.context.setJobStatus(this.getClass().getSimpleName(), "Collect log compaction write status and commit compaction"); +List writeStats = metadata.getWriteStats(); +final HoodieInstant logCompactionInstant = new HoodieInstant(HoodieInstant.State.INFLIGHT, HoodieTimeline.LOG_COMPACTION_ACTION, logCompactionCommitTime); Review Comment: Added ## hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/HoodieFlinkTableServiceClient.java: ## @@ -243,7 +169,31 @@ public HoodieWriteMetadata> cluster(String clusteringInstant, } @Override - protected HoodieTable createTable(HoodieWriteConfig config, Configuration hadoopConf) { + protected void validateClusteringCommit(HoodieWriteMetadata> clusteringMetadata, String clusteringCommitTime, HoodieTable table) { Review Comment: Updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality
the-other-tim-brown commented on code in PR #9371: URL: https://github.com/apache/hudi/pull/9371#discussion_r1285128066 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java: ## @@ -75,10 +83,13 @@ import java.util.stream.Collectors; import java.util.stream.Stream; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMPACTION_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.GREATER_THAN; import static org.apache.hudi.common.util.ValidationUtils.checkArgument; import static org.apache.hudi.metadata.HoodieTableMetadataUtil.isIndexingCommit; -public abstract class BaseHoodieTableServiceClient extends BaseHoodieClient implements RunsTableService { +public abstract class BaseHoodieTableServiceClient extends BaseHoodieClient implements RunsTableService { Review Comment: Added -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality
the-other-tim-brown commented on code in PR #9371: URL: https://github.com/apache/hudi/pull/9371#discussion_r1285127678 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java: ## @@ -974,4 +1159,20 @@ public void close() { // Stop timeline-server if running super.close(); } + + protected void updateTableMetadata(HoodieTable table, HoodieCommitMetadata commitMetadata, + HoodieInstant hoodieInstant, + HoodieData writeStatuses) { +// Do not do any conflict resolution here as we do with regular writes. We take the lock here to ensure all writes to metadata table happens within a +// single lock (single writer). Because more than one write to metadata table will result in conflicts since all of them updates the same partition. +table.getMetadataWriter(hoodieInstant.getTimestamp()) +.ifPresent(writer -> ((HoodieTableMetadataWriter) writer).update(commitMetadata, writeStatuses, hoodieInstant.getTimestamp())); + } + + protected void handleWriteErrors(List writeStats, TableServiceType tableServiceType) { Review Comment: Ok sounds good -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality
the-other-tim-brown commented on code in PR #9371: URL: https://github.com/apache/hudi/pull/9371#discussion_r1285127206 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java: ## @@ -244,12 +305,39 @@ public Option scheduleCompaction(Option> extraMetada * @param metadata All the metadata that gets stored along with a commit * @param extraMetadata Extra Metadata to be stored */ - public abstract void commitCompaction(String compactionInstantTime, HoodieCommitMetadata metadata, Option> extraMetadata); + public void commitCompaction(String compactionInstantTime, HoodieCommitMetadata metadata, Option> extraMetadata) { +extraMetadata.ifPresent(m -> m.forEach(metadata::addMetadata)); +completeCompaction(metadata, createTable(config, context.getHadoopConf().get()), compactionInstantTime); + } /** * Commit Compaction and track metrics. */ - protected abstract void completeCompaction(HoodieCommitMetadata metadata, HoodieTable table, String compactionCommitTime); + protected void completeCompaction(HoodieCommitMetadata metadata, HoodieTable table, String compactionCommitTime) { +this.context.setJobStatus(this.getClass().getSimpleName(), "Collect compaction write status and commit compaction: " + config.getTableName()); +List writeStats = metadata.getWriteStats(); +handleWriteErrors(writeStats, TableServiceType.COMPACT); +final HoodieInstant compactionInstant = HoodieTimeline.getCompactionInflightInstant(compactionCommitTime); +try { + this.txnManager.beginTransaction(Option.of(compactionInstant), Option.empty()); + finalizeWrite(table, compactionCommitTime, writeStats); + // commit to data table after committing to metadata table. + updateTableMetadata(table, metadata, compactionInstant, context.emptyHoodieData()); Review Comment: Do you know why the spark implementation isn't calling the `writeTableMetadata` which each client is required to implement? the implementations are very similar so we may be able to standardize -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9345: [HUDI-6649] Fix column stat based data filtering for MOR
hudi-bot commented on PR #9345: URL: https://github.com/apache/hudi/pull/9345#issuecomment-117084 ## CI report: * 2874d5bf688f588dde8f31d8b266c0a60c3f11bb UNKNOWN * 5d45f2e15f3c08faba116d207c72d52f1c185321 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19115) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality
the-other-tim-brown commented on code in PR #9371: URL: https://github.com/apache/hudi/pull/9371#discussion_r1285126911 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java: ## @@ -174,11 +188,43 @@ private void inlineCompaction(HoodieTable table, Option> ext /** * Ensures compaction instant is in expected state and performs Log Compaction for the workload stored in instant-time.s * - * @param compactionInstantTime Compaction Instant Time + * @param logCompactionInstantTime Compaction Instant Time * @return Collection of Write Status */ - protected HoodieWriteMetadata logCompact(String compactionInstantTime, boolean shouldComplete) { -throw new UnsupportedOperationException("Log compaction is not supported yet."); + protected HoodieWriteMetadata logCompact(String logCompactionInstantTime, boolean shouldComplete) { Review Comment: this would be a breaking change in the event of someone else implementing custom logic in a write client outside of the hudi repo -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality
the-other-tim-brown commented on code in PR #9371: URL: https://github.com/apache/hudi/pull/9371#discussion_r1285126721 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java: ## @@ -146,7 +157,10 @@ protected void setPendingInflightAndRequestedInstants(Set pendingInfligh * @param metadata commit metadata for which pre commit is being invoked. */ protected void preCommit(HoodieCommitMetadata metadata) { -// To be overridden by specific engines to perform conflict resolution if any. +// Create a Hoodie table after startTxn which encapsulated the commits and files visible. Review Comment: I think it makes sense to leave as protected so a future implementation can overwrite it if the want -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality
the-other-tim-brown commented on code in PR #9371: URL: https://github.com/apache/hudi/pull/9371#discussion_r1285126662 ## hudi-client/hudi-java-client/src/main/java/org/apache/hudi/client/HoodieJavaTableServiceClient.java: ## @@ -42,27 +48,31 @@ protected HoodieJavaTableServiceClient(HoodieEngineContext context, } @Override - protected HoodieWriteMetadata> compact(String compactionInstantTime, boolean shouldComplete) { -throw new HoodieNotSupportedException("Compact is not supported in HoodieJavaTableServiceClient"); + protected void validateClusteringCommit(HoodieWriteMetadata> clusteringMetadata, String clusteringCommitTime, HoodieTable table) { Review Comment: Wouldn't validation be the same across clients though? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9371: [HUDI-6647] Expand Hudi Java Client Functionality
nsivabalan commented on code in PR #9371: URL: https://github.com/apache/hudi/pull/9371#discussion_r1285121130 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java: ## @@ -75,10 +83,13 @@ import java.util.stream.Collectors; import java.util.stream.Stream; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMPACTION_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.GREATER_THAN; import static org.apache.hudi.common.util.ValidationUtils.checkArgument; import static org.apache.hudi.metadata.HoodieTableMetadataUtil.isIndexingCommit; -public abstract class BaseHoodieTableServiceClient extends BaseHoodieClient implements RunsTableService { +public abstract class BaseHoodieTableServiceClient extends BaseHoodieClient implements RunsTableService { Review Comment: can we add java docs for the class, also calling out what I, T, O generally refers to ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java: ## @@ -285,7 +373,29 @@ public HoodieWriteMetadata logCompact(String logCompactionInstantTime) { * Commit Log Compaction and track metrics. */ protected void completeLogCompaction(HoodieCommitMetadata metadata, HoodieTable table, String logCompactionCommitTime) { -throw new UnsupportedOperationException("Log compaction is not supported yet."); +this.context.setJobStatus(this.getClass().getSimpleName(), "Collect log compaction write status and commit compaction"); +List writeStats = metadata.getWriteStats(); +final HoodieInstant logCompactionInstant = new HoodieInstant(HoodieInstant.State.INFLIGHT, HoodieTimeline.LOG_COMPACTION_ACTION, logCompactionCommitTime); Review Comment: handleWriteErrors(writeStats, TableServiceType.LOG_COMPACT); ? ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java: ## @@ -146,7 +157,10 @@ protected void setPendingInflightAndRequestedInstants(Set pendingInfligh * @param metadata commit metadata for which pre commit is being invoked. */ protected void preCommit(HoodieCommitMetadata metadata) { -// To be overridden by specific engines to perform conflict resolution if any. +// Create a Hoodie table after startTxn which encapsulated the commits and files visible. Review Comment: do we even need to have this protected now that no engine is overriding it ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java: ## @@ -244,12 +305,39 @@ public Option scheduleCompaction(Option> extraMetada * @param metadata All the metadata that gets stored along with a commit * @param extraMetadata Extra Metadata to be stored */ - public abstract void commitCompaction(String compactionInstantTime, HoodieCommitMetadata metadata, Option> extraMetadata); + public void commitCompaction(String compactionInstantTime, HoodieCommitMetadata metadata, Option> extraMetadata) { +extraMetadata.ifPresent(m -> m.forEach(metadata::addMetadata)); +completeCompaction(metadata, createTable(config, context.getHadoopConf().get()), compactionInstantTime); + } /** * Commit Compaction and track metrics. */ - protected abstract void completeCompaction(HoodieCommitMetadata metadata, HoodieTable table, String compactionCommitTime); + protected void completeCompaction(HoodieCommitMetadata metadata, HoodieTable table, String compactionCommitTime) { +this.context.setJobStatus(this.getClass().getSimpleName(), "Collect compaction write status and commit compaction: " + config.getTableName()); +List writeStats = metadata.getWriteStats(); +handleWriteErrors(writeStats, TableServiceType.COMPACT); +final HoodieInstant compactionInstant = HoodieTimeline.getCompactionInflightInstant(compactionCommitTime); +try { + this.txnManager.beginTransaction(Option.of(compactionInstant), Option.empty()); + finalizeWrite(table, compactionCommitTime, writeStats); + // commit to data table after committing to metadata table. + updateTableMetadata(table, metadata, compactionInstant, context.emptyHoodieData()); Review Comment: again, in flink this was named as writeTableMetadata, while in spark its updateTableMetadata and may be there are reasons for it. can we ensure we don't change anything for flink. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java: ## @@ -974,4 +1159,20 @@ public void close() { // Stop timeline-server if running super.close(); } + + protected void updateTableMetadata(HoodieTable table, HoodieCommitMetadata commitMetadata, + Hood
[GitHub] [hudi] hudi-bot commented on pull request #9261: [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table
hudi-bot commented on PR #9261: URL: https://github.com/apache/hudi/pull/9261#issuecomment-102755 ## CI report: * c1977e21cdf02d75158bd3dec3b335e27755915c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19102) * edd7d00172e57421cc50dcfcc2c90e4b909f67bd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19116) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9261: [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table
hudi-bot commented on PR #9261: URL: https://github.com/apache/hudi/pull/9261#issuecomment-101264 ## CI report: * c1977e21cdf02d75158bd3dec3b335e27755915c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19102) * edd7d00172e57421cc50dcfcc2c90e4b909f67bd UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files
nsivabalan commented on code in PR #9367: URL: https://github.com/apache/hudi/pull/9367#discussion_r1285120895 ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestExternalPathHandling.java: ## @@ -0,0 +1,293 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.client.functional; + +import org.apache.hudi.avro.model.HoodieActionInstant; +import org.apache.hudi.avro.model.HoodieCleanFileInfo; +import org.apache.hudi.avro.model.HoodieCleanMetadata; +import org.apache.hudi.avro.model.HoodieCleanerPlan; +import org.apache.hudi.avro.model.HoodieMetadataColumnStats; +import org.apache.hudi.avro.model.IntWrapper; +import org.apache.hudi.avro.model.StringWrapper; +import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.common.HoodieCleanStat; +import org.apache.hudi.common.config.HoodieMetadataConfig; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.model.HoodieCleaningPolicy; +import org.apache.hudi.common.model.HoodieColumnRangeMetadata; +import org.apache.hudi.common.model.HoodieDeltaWriteStat; +import org.apache.hudi.common.model.HoodieFileGroup; +import org.apache.hudi.common.model.WriteOperationType; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.table.timeline.HoodieTimeline; +import org.apache.hudi.common.table.timeline.TimelineMetadataUtils; +import org.apache.hudi.common.table.view.HoodieTableFileSystemView; +import org.apache.hudi.common.util.CleanerUtils; +import org.apache.hudi.common.util.ExternalFilePathUtil; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.config.HoodieArchivalConfig; +import org.apache.hudi.config.HoodieIndexConfig; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.metadata.HoodieBackedTableMetadata; +import org.apache.hudi.metadata.HoodieMetadataFileSystemView; +import org.apache.hudi.metadata.HoodieTableMetadataWriter; +import org.apache.hudi.table.action.clean.CleanPlanner; +import org.apache.hudi.testutils.HoodieClientTestBase; + +import org.apache.spark.api.java.JavaRDD; +import org.junit.jupiter.api.Assertions; +import org.junit.jupiter.params.ParameterizedTest; +import org.junit.jupiter.params.provider.Arguments; +import org.junit.jupiter.params.provider.MethodSource; + +import java.util.Arrays; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +import static org.apache.hudi.index.HoodieIndex.IndexType.INMEMORY; + +/** + * Asserts that tables initialized from file paths created outside Hudi can properly be loaded. + */ +public class TestExternalPathHandling extends HoodieClientTestBase { + + private static final String FIELD_1 = "field1"; + private static final String FIELD_2 = "field2"; + private HoodieWriteConfig writeConfig; + + @ParameterizedTest + @MethodSource("getArgs") + public void testFlow(FileIdAndNameGenerator fileIdAndNameGenerator, List partitions) throws Exception { +metaClient = HoodieTableMetaClient.reload(metaClient); +writeConfig = HoodieWriteConfig.newBuilder() + .withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(INMEMORY).build()) +.withPath(metaClient.getBasePathV2().toString()) +.withEmbeddedTimelineServerEnabled(false) +.withMetadataConfig(HoodieMetadataConfig.newBuilder() +.withMaxNumDeltaCommitsBeforeCompaction(2) +.withFileSystemBootstrapDisabled(true) +.enable(true) +.withMetadataIndexColumnStats(true) +.withColumnStatsIndexForColumns(FIELD_1 + "," + FIELD_2) +.build()) + .withArchivalConfig(HoodieArchivalConfig.newBuilder().archiveCommitsWith(1, 2).build()) +.withTableServicesEnabled(true) +.build(); + +writeClient = getHoodieWriteClient(writeConfig); +String instantTime1 = writeClient.startCommit(HoodieTimeline.R
[GitHub] [hudi] hudi-bot commented on pull request #9374: [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync
hudi-bot commented on PR #9374: URL: https://github.com/apache/hudi/pull/9374#issuecomment-100068 ## CI report: * ee5d3013bf73645648b344a468404dd076e80003 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19114) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9226: [HUDI-6352] take actual commit time (StateTransitionTime) into consid…
hudi-bot commented on PR #9226: URL: https://github.com/apache/hudi/pull/9226#issuecomment-1666599963 ## CI report: * 747f9d37083a87e614e8109bebf4cc33e7368350 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19112) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files
nsivabalan commented on code in PR #9367: URL: https://github.com/apache/hudi/pull/9367#discussion_r1285119922 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java: ## @@ -324,33 +325,45 @@ public static HoodieRecord createPartitionListRecord(List * @param partitionThe name of the partition * @param filesAdded Mapping of files to their sizes for files which have been added to this partition * @param filesDeleted List of files which have been deleted from this partition + * @param instantTime Commit time of the commit responsible for adding and/or deleting these files, will be empty during bootstrapping of the metadata table */ public static HoodieRecord createPartitionFilesRecord(String partition, - Option> filesAdded, - Option> filesDeleted) { -Map fileInfo = new HashMap<>(); -filesAdded.ifPresent(filesMap -> -fileInfo.putAll( -filesMap.entrySet().stream().collect( -Collectors.toMap(Map.Entry::getKey, (entry) -> { - long fileSize = entry.getValue(); - // Assert that the file-size of the file being added is positive, since Hudi - // should not be creating empty files - checkState(fileSize > 0); - return new HoodieMetadataFileInfo(fileSize, false); -}))) -); -filesDeleted.ifPresent(filesList -> -fileInfo.putAll( -filesList.stream().collect( -Collectors.toMap(Function.identity(), (ignored) -> new HoodieMetadataFileInfo(0L, true -); + Map filesAdded, + List filesDeleted, + Option instantTime) { +int size = filesAdded.size() + filesDeleted.size(); +Map fileInfo = new HashMap<>(size, 1); +filesAdded.forEach((fileName, fileSize) -> { + // Assert that the file-size of the file being added is positive, since Hudi + // should not be creating empty files + checkState(fileSize > 0); + fileInfo.put(handleFileName(fileName, instantTime), new HoodieMetadataFileInfo(fileSize, false)); +}); + +filesDeleted.forEach(fileName -> fileInfo.put(handleFileName(fileName, instantTime), DELETE_FILE_METADATA)); HoodieKey key = new HoodieKey(partition, MetadataPartitionType.FILES.getPartitionPath()); HoodieMetadataPayload payload = new HoodieMetadataPayload(key.getRecordKey(), METADATA_TYPE_FILE_LIST, fileInfo); return new HoodieAvroRecord<>(key, payload); } + /** + * In the case where a file was created by something other than a Hudi writer, the file name will not contain the commit time. We will prefix the file name with hudiext_[commitTime] before storing + * in the metadata table. The constructor for {@link org.apache.hudi.common.model.HoodieBaseFile} will properly handle this prefix. + * @param fileName incoming file name + * @param commitTime time of the commit (will be empty during bootstrap operations) + * @return file name with commit time prefix if the input file name does not contain the commit time, otherwise returns the original input + */ + private static String handleFileName(String fileName, Option commitTime) { +return commitTime.map(commit -> { + if (fileName.contains(commit) || FSUtils.isLogFile(fileName)) { Review Comment: sounds good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9367: [HUDI-6648] Support building table views from existing files
nsivabalan commented on code in PR #9367: URL: https://github.com/apache/hudi/pull/9367#discussion_r1285119538 ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodieBaseFile.java: ## @@ -61,16 +62,39 @@ public HoodieBaseFile(String filePath) { public HoodieBaseFile(String filePath, BaseFile bootstrapBaseFile) { super(filePath); this.bootstrapBaseFile = Option.ofNullable(bootstrapBaseFile); -String[] fileIdAndCommitTime = getFileIdAndCommitTimeFromFileName(); +String[] fileIdAndCommitTime = getFileIdAndCommitTimeFromFileName(getFileName()); this.fileId = fileIdAndCommitTime[0]; this.commitTime = fileIdAndCommitTime[1]; } + public HoodieBaseFile(String filePath, String fileId, String commitTime, BaseFile bootstrapBaseFile) { +super(filePath); +this.bootstrapBaseFile = Option.ofNullable(bootstrapBaseFile); +this.fileId = fileId; +this.commitTime = commitTime; + } + + private HoodieBaseFile(FileStatus fileStatus, String[] fileIdAndCommitTime, BaseFile bootstrapBaseFile) { +this(fileStatus, fileIdAndCommitTime[0], fileIdAndCommitTime[1], bootstrapBaseFile); + } + + public HoodieBaseFile(FileStatus fileStatus, String fileId, String commitTime, BaseFile bootstrapBaseFile) { +super(handleExternallyGeneratedFileName(fileStatus, fileId)); Review Comment: the naming kind of threw me of. can we name is `mayBeHandleExternallyGeneratedFiles` instead of `handleExternallyGeneratedFileName` ## hudi-common/src/main/java/org/apache/hudi/common/util/ExternalFilePathUtil.java: ## @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.util; + +/** + * Utility methods for handling externally created files. + */ +public class ExternalFilePathUtil { + // Suffix acts as a marker when appended to a file path that the path was created by an external system and not a Hudi writer. + private static final String EXTERNAL_FILE_SUFFIX = "_hudiext"; Review Comment: oh, we are changing it to suffix is it. whats the rationale? ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java: ## @@ -324,33 +325,45 @@ public static HoodieRecord createPartitionListRecord(List * @param partitionThe name of the partition * @param filesAdded Mapping of files to their sizes for files which have been added to this partition * @param filesDeleted List of files which have been deleted from this partition + * @param instantTime Commit time of the commit responsible for adding and/or deleting these files, will be empty during bootstrapping of the metadata table */ public static HoodieRecord createPartitionFilesRecord(String partition, - Option> filesAdded, - Option> filesDeleted) { -Map fileInfo = new HashMap<>(); -filesAdded.ifPresent(filesMap -> -fileInfo.putAll( -filesMap.entrySet().stream().collect( -Collectors.toMap(Map.Entry::getKey, (entry) -> { - long fileSize = entry.getValue(); - // Assert that the file-size of the file being added is positive, since Hudi - // should not be creating empty files - checkState(fileSize > 0); - return new HoodieMetadataFileInfo(fileSize, false); -}))) -); -filesDeleted.ifPresent(filesList -> -fileInfo.putAll( -filesList.stream().collect( -Collectors.toMap(Function.identity(), (ignored) -> new HoodieMetadataFileInfo(0L, true -); + Map filesAdded, + List filesDeleted, + Option instantTime) { +int size = filesAdded.size() + filesDeleted.size(); +Map fileInfo = new HashMap<>(size, 1); +filesAdded.forEach((fileName, file
[GitHub] [hudi] hudi-bot commented on pull request #9345: [HUDI-6649] Fix column stat based data filtering for MOR
hudi-bot commented on PR #9345: URL: https://github.com/apache/hudi/pull/9345#issuecomment-1666588552 ## CI report: * dbe51932f391f3acbb1a4bfc3bfef2fb7ec62660 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19106) * 2874d5bf688f588dde8f31d8b266c0a60c3f11bb UNKNOWN * 5d45f2e15f3c08faba116d207c72d52f1c185321 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19115) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9276: [HUDI-6635] Hudi Spark Integration Redesign MOR and Bootstrap reading
hudi-bot commented on PR #9276: URL: https://github.com/apache/hudi/pull/9276#issuecomment-1666588507 ## CI report: * 662f3b320ab6ea06462bad9a4448add1ec2f380a UNKNOWN * f13bb9ce0b2ba2d264ed8cf639a70bd45ee2d578 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19110) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9345: [HUDI-6649] Fix column stat based data filtering for MOR
hudi-bot commented on PR #9345: URL: https://github.com/apache/hudi/pull/9345#issuecomment-1666579827 ## CI report: * dbe51932f391f3acbb1a4bfc3bfef2fb7ec62660 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19106) * 2874d5bf688f588dde8f31d8b266c0a60c3f11bb UNKNOWN * 5d45f2e15f3c08faba116d207c72d52f1c185321 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9345: [HUDI-6649] Fix column stat based data filtering for MOR
hudi-bot commented on PR #9345: URL: https://github.com/apache/hudi/pull/9345#issuecomment-1666578509 ## CI report: * dbe51932f391f3acbb1a4bfc3bfef2fb7ec62660 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19106) * 2874d5bf688f588dde8f31d8b266c0a60c3f11bb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9317: [MINOR] Simplify CreateHoodieTableCommand logWarning
hudi-bot commented on PR #9317: URL: https://github.com/apache/hudi/pull/9317#issuecomment-1666578493 ## CI report: * 7ecfd3c7230df1aea3db5a1ffd4463d1b25f16d7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18944) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19111) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9345: [HUDI-6649] Fix column stat based data filtering for MOR
nsivabalan commented on code in PR #9345: URL: https://github.com/apache/hudi/pull/9345#discussion_r1285089880 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala: ## @@ -106,14 +109,30 @@ case class HoodieFileIndex(spark: SparkSession, * * @return List of FileStatus for base files */ - def allFiles: Seq[FileStatus] = { + def allBaseFiles: Seq[FileStatus] = { Review Comment: private? ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala: ## @@ -122,86 +141,132 @@ case class HoodieFileIndex(spark: SparkSession, * @return list of PartitionDirectory containing partition to base files mapping */ override def listFiles(partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Seq[PartitionDirectory] = { -// Look up candidate files names in the col-stats index, if all of the following conditions are true -//- Data-skipping is enabled -//- Col-Stats Index is present -//- List of predicates (filters) is present -val candidateFilesNamesOpt: Option[Set[String]] = +// Prune the partition path by the partition filters +// NOTE: Non-partitioned tables are assumed to consist from a single partition +// encompassing the whole table +val partitionsAndFileSlices = getFileSlicesForPrunedPartitions(partitionFilters) +val partitionsAndFilteredFileSlices = filterFileSlices(dataFilters, partitionsAndFileSlices).map { + case (partitionOpt, fileSlices) => +val allCandidateFiles: Seq[FileStatus] = fileSlices.flatMap(fs => { + val baseFileStatusOpt = getBaseFileStatus(Option.apply(fs.getBaseFile.orElse(null))) + val logFilesStatus = if (includeLogFiles) { + fs.getLogFiles.map[FileStatus](JFunction.toJavaFunction[HoodieLogFile, FileStatus](lf => lf.getFileStatus)) + } else { +java.util.stream.Stream.empty() + } + val files = logFilesStatus.collect(Collectors.toList[FileStatus]).asScala + baseFileStatusOpt.foreach(f => files.append(f)) + files +}) + +PartitionDirectory(InternalRow.fromSeq(partitionOpt.get.values), allCandidateFiles) +} + +hasPushedDownPartitionPredicates = true + +if (shouldReadAsPartitionedTable()) { + partitionsAndFilteredFileSlices +} else { + Seq(PartitionDirectory(InternalRow.empty, partitionsAndFilteredFileSlices.flatMap(_.files))) +} + } + + def filterFileSlices(dataFilters: Seq[Expression], partitionAndFileSlices: Seq[(Option[BaseHoodieTableFileIndex.PartitionPath], Seq[FileSlice])]) + : Seq[(Option[BaseHoodieTableFileIndex.PartitionPath], Seq[FileSlice])] = { +// If there are no data filters, return all the file slices. +// If there are no file slices, return empty list. +if (partitionAndFileSlices.isEmpty || dataFilters.isEmpty) { + partitionAndFileSlices +} else { + // Look up candidate files names in the col-stats index, if all of the following conditions are true + //- Data-skipping is enabled + //- Col-Stats Index is present + //- List of predicates (filters) is present + val candidateFilesNamesOpt: Option[Set[String]] = Review Comment: lets file a follow up ticket. looks like we are not leveraging the pruned partition filters while polling col stats only. Anyways, after we get the list from col stats, we will compare against allFils and trim it down properly. So, its not very bad. but the amount of data fetched from col stats could be trimmed ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala: ## @@ -69,6 +71,8 @@ abstract class BaseMergeOnReadSnapshotRelation(sqlContext: SQLContext, prunedDataSchema: Option[StructType]) extends HoodieBaseRelation(sqlContext, metaClient, optParams, userSchema, prunedDataSchema) { + fileIndex.setIncludeLogFiles(true) Review Comment: I feel this is not elegant and is left open and the caller can invoke anytime. Can we introduce a protected method called shouldIncludeLogFiles() to HoodieBaseRelation and set the default value to false. and we can override that in MergeOnReadSnapshotRelation or BaseMergeOnReadSnapshotRelation and set the value to true. we can change the constructor or HoodieFileIndex to take an additional arg called shouldIncludeLogFiles, and the callers are expected to set it when invoking the constructor ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala: ## @@ -122,86 +141,132 @@ case class HoodieFileIndex(spark: SparkSession, * @return list of PartitionDirectory containing partition to base files mapping */ override def listFiles(partitionFilters: Seq[Expression
[GitHub] [hudi] hudi-bot commented on pull request #9365: [HUDI-6646] Add default lock provider for spark offline compaction an…
hudi-bot commented on PR #9365: URL: https://github.com/apache/hudi/pull/9365#issuecomment-1666577154 ## CI report: * 320ff70e6f0d02b427d9dc868628934f0323fb4e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19109) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lokeshj1703 commented on pull request #9345: [HUDI-6649] Fix column stat based data filtering for MOR
lokeshj1703 commented on PR #9345: URL: https://github.com/apache/hudi/pull/9345#issuecomment-1666574758 I have reverted the incremental relation changes, the test was passing. My bad. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lokeshj1703 commented on a diff in pull request #9345: [HUDI-6649] Fix column stat based data filtering for MOR
lokeshj1703 commented on code in PR #9345: URL: https://github.com/apache/hudi/pull/9345#discussion_r1285100568 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestColumnStatsIndexWithSQL.scala: ## @@ -0,0 +1,379 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.functional + +import org.apache.hudi.DataSourceWriteOptions.{DELETE_OPERATION_OPT_VAL, PRECOMBINE_FIELD, RECORDKEY_FIELD} +import org.apache.hudi.async.SparkAsyncCompactService +import org.apache.hudi.client.SparkRDDWriteClient +import org.apache.hudi.client.common.HoodieSparkEngineContext +import org.apache.hudi.client.utils.MetadataConversionUtils +import org.apache.hudi.common.config.HoodieMetadataConfig +import org.apache.hudi.common.model.{FileSlice, HoodieCommitMetadata, HoodieTableType, WriteOperationType} +import org.apache.hudi.common.table.HoodieTableConfig +import org.apache.hudi.common.table.timeline.HoodieInstant +import org.apache.hudi.config.{HoodieCompactionConfig, HoodieIndexConfig, HoodieWriteConfig} +import org.apache.hudi.functional.ColumnStatIndexTestBase.ColumnStatsTestCase +import org.apache.hudi.index.HoodieIndex.IndexType.INMEMORY +import org.apache.hudi.metadata.HoodieMetadataFileSystemView +import org.apache.hudi.util.{JFunction, JavaConversions} +import org.apache.hudi.{DataSourceReadOptions, DataSourceWriteOptions, HoodieFileIndex} +import org.apache.spark.sql._ +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression, GreaterThan, Literal} +import org.apache.spark.sql.types.StringType +import org.junit.jupiter.api.Assertions.{assertEquals, assertFalse, assertTrue} +import org.junit.jupiter.api.Disabled +import org.junit.jupiter.params.ParameterizedTest +import org.junit.jupiter.params.provider.MethodSource + +import java.util.Properties +import scala.collection.JavaConverters + +class TestColumnStatsIndexWithSQL extends ColumnStatIndexTestBase { + + @ParameterizedTest + @MethodSource(Array("testMetadataColumnStatsIndexParams")) + def testMetadataColumnStatsIndexWithSQL(testCase: ColumnStatsTestCase): Unit = { +val metadataOpts = Map( + HoodieMetadataConfig.ENABLE.key -> "true", + HoodieMetadataConfig.ENABLE_METADATA_INDEX_COLUMN_STATS.key -> "true" +) + +val commonOpts = Map( + "hoodie.insert.shuffle.parallelism" -> "4", + "hoodie.upsert.shuffle.parallelism" -> "4", + HoodieWriteConfig.TBL_NAME.key -> "hoodie_test", + DataSourceWriteOptions.TABLE_TYPE.key -> testCase.tableType.toString, + RECORDKEY_FIELD.key -> "c1", + PRECOMBINE_FIELD.key -> "c1", + HoodieTableConfig.POPULATE_META_FIELDS.key -> "true", + DataSourceReadOptions.ENABLE_DATA_SKIPPING.key -> "true", + DataSourceReadOptions.QUERY_TYPE.key -> DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL +) ++ metadataOpts +setupTable(testCase, metadataOpts, commonOpts, shouldValidate = true) +verifyFileIndexAndSQLQueries(commonOpts) + } + + @ParameterizedTest + @MethodSource(Array("testMetadataColumnStatsIndexParamsForMOR")) + def testMetadataColumnStatsIndexSQLWithInMemoryIndex(testCase: ColumnStatsTestCase): Unit = { +val metadataOpts = Map( + HoodieMetadataConfig.ENABLE.key -> "true", + HoodieMetadataConfig.ENABLE_METADATA_INDEX_COLUMN_STATS.key -> "true" +) + +val commonOpts = Map( + "hoodie.insert.shuffle.parallelism" -> "4", + "hoodie.upsert.shuffle.parallelism" -> "4", + HoodieWriteConfig.TBL_NAME.key -> "hoodie_test", + DataSourceWriteOptions.TABLE_TYPE.key -> testCase.tableType.toString, + RECORDKEY_FIELD.key -> "c1", + PRECOMBINE_FIELD.key -> "c1", + HoodieTableConfig.POPULATE_META_FIELDS.key -> "true", + DataSourceReadOptions.ENABLE_DATA_SKIPPING.key -> "true", + DataSourceReadOptions.QUERY_TYPE.key -> DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL, + HoodieIndexConfig.INDEX_TYPE.key() -> INMEMORY.name() +) ++ metadataOpts + +doWriteAndValidateColumnStats(testCase, metadataOpts, commonOpts, + dataSourcePath = "index/colstats/input-table-json", + expectedColStatsSourcePath = "index/colstats/column-stats-ind
[GitHub] [hudi] hudi-bot commented on pull request #9373: [HUDI-6651] Support IN SQL query with Record Index
hudi-bot commented on PR #9373: URL: https://github.com/apache/hudi/pull/9373#issuecomment-1666562451 ## CI report: * 320d28961b879b719401a5a21ce4c54e1d0b1073 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19113) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9373: [HUDI-6651] Support IN SQL query with Record Index
nsivabalan commented on code in PR #9373: URL: https://github.com/apache/hudi/pull/9373#discussion_r1285082443 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala: ## @@ -343,7 +343,7 @@ abstract class HoodieBaseRelation(val sqlContext: SQLContext, */ override final def needConversion: Boolean = false - override def inputFiles: Array[String] = fileIndex.allFiles.map(_.getPath.toUri.toString).toArray + override def inputFiles: Array[String] = fileIndex.allBaseFiles.map(_.getPath.toUri.toString).toArray Review Comment: Are we sure on this change? also, can we add java docs as to whats the expected file list here and how it is used. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9374: [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync
hudi-bot commented on PR #9374: URL: https://github.com/apache/hudi/pull/9374#issuecomment-1666550244 ## CI report: * ee5d3013bf73645648b344a468404dd076e80003 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19114) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9373: [HUDI-6651] Support IN SQL query with Record Index
nsivabalan commented on code in PR #9373: URL: https://github.com/apache/hudi/pull/9373#discussion_r1285082443 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala: ## @@ -343,7 +343,7 @@ abstract class HoodieBaseRelation(val sqlContext: SQLContext, */ override final def needConversion: Boolean = false - override def inputFiles: Array[String] = fileIndex.allFiles.map(_.getPath.toUri.toString).toArray + override def inputFiles: Array[String] = fileIndex.allBaseFiles.map(_.getPath.toUri.toString).toArray Review Comment: Are we sure on this change? also, can we add java docs as to whats the expected file list here and how it is used. ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestRecordLevelIndexWithSQL.scala: ## @@ -45,12 +52,71 @@ class TestRecordLevelIndexWithSQL extends RecordLevelIndexTestBase { validate = false) createTempTable(hudiOpts) -val reckey = mergedDfList.last.limit(1).collect()(0).getAs("_row_key").toString -spark.sql("select * from " + sqlTempTable + " where '" + reckey + "' = _row_key").show(false) +testInQuery(hudiOpts) +testEqualToQuery(hudiOpts) + } + + def testEqualToQuery(hudiOpts: Map[String, String]): Unit = { +val reckey = mergedDfList.last.limit(1).collect().map(row => row.getAs("_row_key").toString) +val dataFilter = EqualTo(attribute("_row_key"), Literal(reckey(0))) +assertEquals(1, spark.sql("select * from " + sqlTempTable + " where " + dataFilter.sql).count()) +verifyPruningFileCount(hudiOpts, dataFilter, 1) + } + + def testInQuery(hudiOpts: Map[String, String]): Unit = { +var reckey = mergedDfList.last.limit(1).collect().map(row => row.getAs("_row_key").toString) +var dataFilter = In(attribute("_row_key"), reckey.map(l => literal(l)).toList) +assertEquals(1, spark.sql("select * from " + sqlTempTable + " where " + dataFilter.sql).count()) +var numFiles = if (isTableMOR()) 2 else 1 +verifyPruningFileCount(hudiOpts, dataFilter, numFiles) + +reckey = mergedDfList.last.limit(2).collect().map(row => row.getAs("_row_key").toString) +dataFilter = In(attribute("_row_key"), reckey.map(l => literal(l)).toList) +assertEquals(2, spark.sql("select * from " + sqlTempTable + " where " + dataFilter.sql).count()) +numFiles = if (isTableMOR()) 2 else 2 +verifyPruningFileCount(hudiOpts, dataFilter, numFiles) + } + + private def attribute(partition: String): AttributeReference = { +AttributeReference(partition, StringType, true)() + } + + private def literal(value: String): Literal = { +Literal.create(value) + } + + private def verifyPruningFileCount(opts: Map[String, String], dataFilter: Expression, numFiles: Int): Unit = { +metaClient = HoodieTableMetaClient.reload(metaClient) +val fileIndex = HoodieFileIndex(spark, metaClient, None, opts + ("path" -> basePath)) +fileIndex.setIncludeLogFiles(isTableMOR()) +val filteredPartitionDirectories = fileIndex.listFiles(Seq(), Seq(dataFilter)) +val filteredFilesCount = filteredPartitionDirectories.flatMap(s => s.files).size +assertTrue(filteredFilesCount < getLatestDataFilesCount(opts)) +assertEquals(filteredFilesCount, numFiles) + } + + private def isTableMOR(): Boolean = { +metaClient.getTableType == HoodieTableType.MERGE_ON_READ + } + + private def getLatestDataFilesCount(opts: Map[String, String], includeLogFiles: Boolean = true) = { +var totalLatestDataFiles = 0L + getTableFileSystenView(opts).getAllLatestFileSlicesBeforeOrOn(metaClient.getActiveTimeline.lastInstant().get().getTimestamp) + .values() + .forEach(JFunction.toJavaConsumer[java.util.stream.Stream[FileSlice]] +(slices => slices.forEach(JFunction.toJavaConsumer[FileSlice]( + slice => totalLatestDataFiles += (if (includeLogFiles) slice.getLogFiles.count() else 0) ++ (if (slice.getBaseFile.isPresent) 1 else 0) +totalLatestDataFiles + } + + private def getTableFileSystenView(opts: Map[String, String]): HoodieMetadataFileSystemView = { +new HoodieMetadataFileSystemView(metaClient, metaClient.getActiveTimeline, metadataWriter(getWriteConfig(opts)).getTableMetadata) } private def createTempTable(hudiOpts: Map[String, String]): Unit = { val readDf = spark.read.format("hudi").options(hudiOpts).load(basePath) +readDf.printSchema() Review Comment: whats the necessity for this ? ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala: ## @@ -123,28 +123,47 @@ class RecordLevelIndexSupport(spark: SparkSession, * @param queryFilters The queries that need to be filtered. * @return Tuple of List of filtered queries and list of record key literals that need to be matched */ - private def filterQueryFiltersWithRecordKey(queryFilters: Seq[Expression]): (L
[GitHub] [hudi] hudi-bot commented on pull request #9374: [HUDI-6652] Implement basePath-level synchronization in runHoodieMetaSync
hudi-bot commented on PR #9374: URL: https://github.com/apache/hudi/pull/9374#issuecomment-1666540053 ## CI report: * ee5d3013bf73645648b344a468404dd076e80003 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9373: [HUDI-6651] Support IN SQL query with Record Index
hudi-bot commented on PR #9373: URL: https://github.com/apache/hudi/pull/9373#issuecomment-1666540045 ## CI report: * 320d28961b879b719401a5a21ce4c54e1d0b1073 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19113) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9287: [HUDI-6592] Flink insert overwrite should support dynamic partition and whole table
hudi-bot commented on PR #9287: URL: https://github.com/apache/hudi/pull/9287#issuecomment-1666540011 ## CI report: * cc2d8c36f8545608607cd3c4bcd481369d668d76 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19107) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on a diff in pull request #9345: [HUDI-6649] Fix column stat based data filtering for MOR
codope commented on code in PR #9345: URL: https://github.com/apache/hudi/pull/9345#discussion_r1285078625 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/ColumnStatIndexTestBase.scala: ## @@ -0,0 +1,283 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.functional + +import org.apache.hadoop.fs.{LocatedFileStatus, Path} +import org.apache.hudi.ColumnStatsIndexSupport.composeIndexSchema +import org.apache.hudi.HoodieConversionUtils.toProperties +import org.apache.hudi.common.config.{HoodieMetadataConfig, HoodieStorageConfig} +import org.apache.hudi.common.model.HoodieTableType +import org.apache.hudi.common.table.HoodieTableMetaClient +import org.apache.hudi.functional.ColumnStatIndexTestBase.ColumnStatsTestCase +import org.apache.hudi.testutils.HoodieSparkClientTestBase +import org.apache.hudi.{ColumnStatsIndexSupport, DataSourceWriteOptions} +import org.apache.spark.sql._ +import org.apache.spark.sql.functions.typedLit +import org.apache.spark.sql.types._ +import org.junit.jupiter.api.Assertions.assertEquals +import org.junit.jupiter.api._ +import org.junit.jupiter.params.provider.Arguments + +import java.math.BigInteger +import java.sql.{Date, Timestamp} +import scala.collection.JavaConverters._ +import scala.util.Random + +@Tag("functional") +class ColumnStatIndexTestBase extends HoodieSparkClientTestBase { + var spark: SparkSession = _ + var dfList: Seq[DataFrame] = Seq() + + val sourceTableSchema = +new StructType() + .add("c1", IntegerType) + .add("c2", StringType) + .add("c3", DecimalType(9, 3)) + .add("c4", TimestampType) + .add("c5", ShortType) + .add("c6", DateType) + .add("c7", BinaryType) + .add("c8", ByteType) + + @BeforeEach + override def setUp() { +initPath() +initSparkContexts() +initFileSystem() + +setTableName("hoodie_test") +initMetaClient() + +spark = sqlContext.sparkSession + } + + @AfterEach + override def tearDown() = { +cleanupFileSystem() +cleanupSparkContexts() + } + + protected def doWriteAndValidateColumnStats(testCase: ColumnStatsTestCase, +metadataOpts: Map[String, String], +hudiOpts: Map[String, String], +dataSourcePath: String, +expectedColStatsSourcePath: String, +operation: String, +saveMode: SaveMode, +shouldValidate: Boolean = true): Unit = { +val sourceJSONTablePath = getClass.getClassLoader.getResource(dataSourcePath).toString + +// NOTE: Schema here is provided for validation that the input date is in the appropriate format +val inputDF = spark.read.schema(sourceTableSchema).json(sourceJSONTablePath) + +inputDF + .sort("c1") + .repartition(4, new Column("c1")) + .write + .format("hudi") + .options(hudiOpts) + .option(HoodieStorageConfig.PARQUET_MAX_FILE_SIZE.key, 10 * 1024) + .option(DataSourceWriteOptions.OPERATION.key, operation) + .mode(saveMode) + .save(basePath) +dfList = dfList :+ inputDF + +metaClient = HoodieTableMetaClient.reload(metaClient) + +if (shouldValidate) { + // Currently, routine manually validating the column stats (by actually reading every column of every file) + // only supports parquet files. Therefore we skip such validation when delta-log files are present, and only + // validate in following cases: (1) COW: all operations; (2) MOR: insert only. + val shouldValidateColumnStatsManually = testCase.tableType == HoodieTableType.COPY_ON_WRITE || +operation.equals(DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) + + validateColumnStatsIndex( +testCase, metadataOpts, expectedColStatsSourcePath, shouldValidateColumnStatsManually) +} + } + + protected def buildColumnStatsTableManually(tablePath: String, +includedCols: Seq[String], +
[GitHub] [hudi] hudi-bot commented on pull request #9373: [HUDI-6651] Support IN SQL query with Record Index
hudi-bot commented on PR #9373: URL: https://github.com/apache/hudi/pull/9373#issuecomment-1666538376 ## CI report: * 320d28961b879b719401a5a21ce4c54e1d0b1073 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org