[jira] [Updated] (HUDI-2619) Make table services work with Dataset
[ https://issues.apache.org/jira/browse/HUDI-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2619: - Description: Clustering, Compaction, Clean should also work with Dataset > Make table services work with Dataset > -- > > Key: HUDI-2619 > URL: https://issues.apache.org/jira/browse/HUDI-2619 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > > Clustering, Compaction, Clean should also work with Dataset -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2619) Make table services work with Dataset
Raymond Xu created HUDI-2619: Summary: Make table services work with Dataset Key: HUDI-2619 URL: https://issues.apache.org/jira/browse/HUDI-2619 Project: Apache Hudi Issue Type: Sub-task Reporter: Raymond Xu Fix For: 0.10.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2618) Implement operations other than upsert in SparkDataFrameWriteClient
[ https://issues.apache.org/jira/browse/HUDI-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2618: - Story Points: 3 (was: 4) > Implement operations other than upsert in SparkDataFrameWriteClient > --- > > Key: HUDI-2618 > URL: https://issues.apache.org/jira/browse/HUDI-2618 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2618) Implement operations other than upsert in SparkDataFrameWriteClient
Raymond Xu created HUDI-2618: Summary: Implement operations other than upsert in SparkDataFrameWriteClient Key: HUDI-2618 URL: https://issues.apache.org/jira/browse/HUDI-2618 Project: Apache Hudi Issue Type: Sub-task Reporter: Raymond Xu Fix For: 0.10.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2618) Implement operations other than upsert in SparkDataFrameWriteClient
[ https://issues.apache.org/jira/browse/HUDI-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2618: - Story Points: 4 > Implement operations other than upsert in SparkDataFrameWriteClient > --- > > Key: HUDI-2618 > URL: https://issues.apache.org/jira/browse/HUDI-2618 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2617) Implement HBase Index for Dataset
[ https://issues.apache.org/jira/browse/HUDI-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2617: - Fix Version/s: 0.10.0 > Implement HBase Index for Dataset > -- > > Key: HUDI-2617 > URL: https://issues.apache.org/jira/browse/HUDI-2617 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2615) Decouple HoodieRecordPayload with Hoodie table, table services, and index
[ https://issues.apache.org/jira/browse/HUDI-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2615: - Fix Version/s: 0.10.0 > Decouple HoodieRecordPayload with Hoodie table, table services, and index > - > > Key: HUDI-2615 > URL: https://issues.apache.org/jira/browse/HUDI-2615 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > > HoodieTable, HoodieIndex, and compaction, clustering services should be > independent of HoodieRecordPayload -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2531) [UMBRELLA] Support Dataset APIs in writer paths
[ https://issues.apache.org/jira/browse/HUDI-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2531: - Fix Version/s: 0.10.0 > [UMBRELLA] Support Dataset APIs in writer paths > --- > > Key: HUDI-2531 > URL: https://issues.apache.org/jira/browse/HUDI-2531 > Project: Apache Hudi > Issue Type: New Feature > Components: Spark Integration >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Blocker > Labels: hudi-umbrellas > Fix For: 0.10.0 > > > To make use of Dataset APIs in writer paths instead of RDD. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2616) Implement BloomIndex for Dataset
[ https://issues.apache.org/jira/browse/HUDI-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2616: - Fix Version/s: 0.10.0 > Implement BloomIndex for Dataset > - > > Key: HUDI-2616 > URL: https://issues.apache.org/jira/browse/HUDI-2616 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Raymond Xu >Priority: Major > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] danny0405 commented on a change in pull request #3599: [HUDI-2207] Support independent flink hudi clustering function
danny0405 commented on a change in pull request #3599: URL: https://github.com/apache/hudi/pull/3599#discussion_r735249946 ## File path: hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java ## @@ -528,6 +528,66 @@ private FlinkOptions() { .defaultValue(20)// default min 20 commits .withDescription("Min number of commits to keep before archiving older commits into a sequential log, default 20"); + // + // Clustering Options + // + + public static final ConfigOption CLUSTERING_SCHEDULE_ENABLED = ConfigOptions + .key("clustering.schedule.enabled") + .booleanType() + .defaultValue(false) // default false for pipeline + .withDescription("Async clustering, default false for pipeline"); + + public static final ConfigOption CLUSTERING_TASKS = ConfigOptions + .key("clustering.tasks") + .intType() + .defaultValue(10) + .withDescription("Parallelism of tasks that do actual clustering, default is 10"); Review comment: Change the default value same with `compaction.tasks`, which is `4`. ## File path: hudi-flink/src/test/java/org/apache/hudi/sink/cluster/ITTestHoodieFlinkClustering.java ## @@ -0,0 +1,178 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.sink.cluster; + +import org.apache.hudi.avro.model.HoodieClusteringPlan; +import org.apache.hudi.client.HoodieFlinkWriteClient; +import org.apache.hudi.common.model.WriteOperationType; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.table.timeline.HoodieTimeline; +import org.apache.hudi.common.util.ClusteringUtils; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.configuration.FlinkOptions; +import org.apache.hudi.sink.clustering.ClusteringCommitEvent; +import org.apache.hudi.sink.clustering.ClusteringCommitSink; +import org.apache.hudi.sink.clustering.ClusteringFunction; +import org.apache.hudi.sink.clustering.ClusteringPlanSourceFunction; +import org.apache.hudi.sink.clustering.FlinkClusteringConfig; +import org.apache.hudi.table.HoodieFlinkTable; +import org.apache.hudi.util.AvroSchemaConverter; +import org.apache.hudi.util.CompactionUtil; +import org.apache.hudi.util.StreamerUtil; +import org.apache.hudi.utils.TestConfigurations; +import org.apache.hudi.utils.TestData; + +import org.apache.avro.Schema; +import org.apache.flink.api.common.typeinfo.TypeInformation; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; +import org.apache.flink.streaming.api.operators.ProcessOperator; +import org.apache.flink.table.api.EnvironmentSettings; +import org.apache.flink.table.api.TableEnvironment; +import org.apache.flink.table.api.config.ExecutionConfigOptions; +import org.apache.flink.table.api.internal.TableEnvironmentImpl; +import org.apache.flink.table.types.DataType; +import org.apache.flink.table.types.logical.RowType; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.api.io.TempDir; + +import java.io.File; +import java.util.HashMap; +import java.util.Map; +import java.util.concurrent.TimeUnit; + +import static org.junit.jupiter.api.Assertions.assertTrue; + +public class ITTestHoodieFlinkClustering { + + private static final Map EXPECTED = new HashMap<>(); + + static { +EXPECTED.put("par1", "[id1,par1,id1,Danny,23,1000,par1, id2,par1,id2,Stephen,33,2000,par1]"); +EXPECTED.put("par2", "[id3,par2,id3,Julian,53,3000,par2, id4,par2,id4,Fabian,31,4000,par2]"); +EXPECTED.put("par3", "[id5,par3,id5,Sophia,18,5000,par3, id6,par3,id6,Emma,20,6000,par3]"); +EXPECTED.put("par4", "[id7,par4,id7,Bob,44,7000,par4, id8,par4,id8,Han,56,8000,par4]"); + } + + @TempDir + File tempFile; + + @Test + public void testHoodieFlinkClustering() throws Exception { +// Create h
[jira] [Created] (HUDI-2617) Implement HBase Index for Dataset
Raymond Xu created HUDI-2617: Summary: Implement HBase Index for Dataset Key: HUDI-2617 URL: https://issues.apache.org/jira/browse/HUDI-2617 Project: Apache Hudi Issue Type: Sub-task Reporter: Raymond Xu -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex
[ https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1430: - Description: End to end upsert operation, with proper functional tests coverage. > Implement SparkDataFrameWriteClient with SimpleIndex > > > Key: HUDI-1430 > URL: https://issues.apache.org/jira/browse/HUDI-1430 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: sivabalan narayanan >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > > End to end upsert operation, with proper functional tests coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex
[ https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1430: - Story Points: 3 (was: 2) > Implement SparkDataFrameWriteClient with SimpleIndex > > > Key: HUDI-1430 > URL: https://issues.apache.org/jira/browse/HUDI-1430 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: sivabalan narayanan >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2616) Implement BloomIndex for Dataset
[ https://issues.apache.org/jira/browse/HUDI-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2616: - Story Points: 2 > Implement BloomIndex for Dataset > - > > Key: HUDI-2616 > URL: https://issues.apache.org/jira/browse/HUDI-2616 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Raymond Xu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2616) Implement BloomIndex for Dataset
Raymond Xu created HUDI-2616: Summary: Implement BloomIndex for Dataset Key: HUDI-2616 URL: https://issues.apache.org/jira/browse/HUDI-2616 Project: Apache Hudi Issue Type: Sub-task Reporter: Raymond Xu -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2615) Decouple HoodieRecordPayload with Hoodie table, table services, and index
Raymond Xu created HUDI-2615: Summary: Decouple HoodieRecordPayload with Hoodie table, table services, and index Key: HUDI-2615 URL: https://issues.apache.org/jira/browse/HUDI-2615 Project: Apache Hudi Issue Type: Sub-task Reporter: Raymond Xu HoodieTable, HoodieIndex, and compaction, clustering services should be independent of HoodieRecordPayload -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2531) [UMBRELLA] Support Dataset APIs in writer paths
[ https://issues.apache.org/jira/browse/HUDI-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2531: - Priority: Blocker (was: Critical) > [UMBRELLA] Support Dataset APIs in writer paths > --- > > Key: HUDI-2531 > URL: https://issues.apache.org/jira/browse/HUDI-2531 > Project: Apache Hudi > Issue Type: New Feature > Components: Spark Integration >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Blocker > Labels: hudi-umbrellas > > To make use of Dataset APIs in writer paths instead of RDD. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex
[ https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1430: - Story Points: 2 > Implement SparkDataFrameWriteClient with SimpleIndex > > > Key: HUDI-1430 > URL: https://issues.apache.org/jira/browse/HUDI-1430 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: sivabalan narayanan >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex
[ https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1430: - Status: In Progress (was: Open) > Implement SparkDataFrameWriteClient with SimpleIndex > > > Key: HUDI-1430 > URL: https://issues.apache.org/jira/browse/HUDI-1430 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: sivabalan narayanan >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex
[ https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1430: - Parent: HUDI-2531 Issue Type: Sub-task (was: Improvement) > Implement SparkDataFrameWriteClient with SimpleIndex > > > Key: HUDI-1430 > URL: https://issues.apache.org/jira/browse/HUDI-1430 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: sivabalan narayanan >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex
[ https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1430: - Summary: Implement SparkDataFrameWriteClient with SimpleIndex (was: Support Dataset write w/o conversion to RDD) > Implement SparkDataFrameWriteClient with SimpleIndex > > > Key: HUDI-1430 > URL: https://issues.apache.org/jira/browse/HUDI-1430 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: sivabalan narayanan >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1970) Performance testing/certification of key SQL DMLs
[ https://issues.apache.org/jira/browse/HUDI-1970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1970: - Status: In Progress (was: Open) > Performance testing/certification of key SQL DMLs > - > > Key: HUDI-1970 > URL: https://issues.apache.org/jira/browse/HUDI-1970 > Project: Apache Hudi > Issue Type: Sub-task > Components: Performance, Spark Integration >Reporter: Vinoth Chandar >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1970) Performance testing/certification of key SQL DMLs
[ https://issues.apache.org/jira/browse/HUDI-1970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17433587#comment-17433587 ] Raymond Xu commented on HUDI-1970: -- * 1B records (randomized values in the example trip model) * 100 partitions, evenly distributed, year=*/month=*/day=*, 50 parquet files / partition * EMR 6.2 Spark 3.0.1-amzn-0 * S3, parquet compression snappy * hudi: 109.8 GB = 22.4 MB parquet x 5000 * delta: 70.9 GB = 14.5 MB parquet x 5000 |SQL|Hudi 0.9.0| |select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 20.0|129.352|108.312|104.914| |select count(*) from hudi_trips_snapshot|96.001|83.839|66.973| |select count(*) from hudi_trips_snapshot where year = '2020' and month = '03' and day = '01'|1.880|1.776|1.767| |select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where year='2020' and month='03' and day='01' and fare between 20 and 50|3.650|3.147|3.086| > Performance testing/certification of key SQL DMLs > - > > Key: HUDI-1970 > URL: https://issues.apache.org/jira/browse/HUDI-1970 > Project: Apache Hudi > Issue Type: Sub-task > Components: Performance, Spark Integration >Reporter: Vinoth Chandar >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2287) Partition pruning not working on Hudi dataset
[ https://issues.apache.org/jira/browse/HUDI-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2287: - Priority: Major (was: Blocker) > Partition pruning not working on Hudi dataset > - > > Key: HUDI-2287 > URL: https://issues.apache.org/jira/browse/HUDI-2287 > Project: Apache Hudi > Issue Type: Sub-task > Components: Performance >Reporter: Rajkumar Gunasekaran >Assignee: Raymond Xu >Priority: Major > Fix For: 0.10.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Hi, we have created a Hudi dataset which has two level partition like this > {code:java} > s3://somes3bucket/partition1=value/partition2=value > {code} > where _partition1_ and _partition2_ is of type string > When running a simple count query using Hudi format in spark-shell, it takes > almost 3 minutes to complete > > {code:scala} > spark.read.format("hudi").load("s3://somes3bucket"). > where("partition1 = 'somevalue' and partition2 = 'somevalue'"). > count() > > res1: Long = > attempt 1: 3.2 minutes > attempt 2: 2.5 minutes > {code} > In the Spark UI ~9000 tasks (which is approximately equivalent to the total > no of files in the ENTIRE dataset s3://somes3bucket) are used for > computation. Seems like spark is reading the entire dataset instead of > *partition pruning.*...and then filtering the dataset based on the where > clause > Whereas, if I use the parquet format to read the dataset, the query only > takes ~30 seconds (vis-a-vis 3 minutes with Hudi format) > {code:scala} > spark.read.parquet("s3://somes3bucket"). > where("partition1 = 'somevalue' and partition2 = 'somevalue'"). > count() > res2: Long = > ~ 30 seconds > {code} > In the spark UI, only 1361 (ie 1361 tasks) files are scanned (vis-a-vis ~9000 > files in Hudi) and takes only 15 seconds > Any idea why partition pruning is not working when using Hudi format? > Wondering if I am missing any configuration during the creation of the > dataset? > PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0 and here is > the configuration I have used for creating the dataset > {code:scala} > df.writeStream > .trigger(Trigger.ProcessingTime(s"${param.triggerTimeInSeconds} seconds")) > .partitionBy("partition1","partition2") > .format("org.apache.hudi") > .option(HoodieWriteConfig.TABLE_NAME, param.hiveNHudiTableName.get) > //-- > .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy") > .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, > param.expectedFileSizeInBytes) > .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, > HoodieStorageConfig.DEFAULT_PARQUET_BLOCK_SIZE_BYTES) > //-- > .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, > (param.expectedFileSizeInBytes / 100) * 80) > .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "true") > .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, > param.runCompactionAfterNDeltaCommits.get) > //-- > .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL) > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "record_key_id") > .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, > classOf[CustomKeyGenerator].getName) > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, > "partition1:SIMPLE,partition2:SIMPLE") > .option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, > hudiTablePrecombineKey) > .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") > //.option(DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY, "false") > .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true") > .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, > "partition1,partition2") > .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, param.hiveDb.get) > .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, > param.hiveNHudiTableName.get) > .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > classOf[MultiPartKeysValueExtractor].getName) > .outputMode(OutputMode.Append()) > .queryName(s"${param.hiveDb}_${param.hiveNHudiTableName}_query"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2287) Partition pruning not working on Hudi dataset
[ https://issues.apache.org/jira/browse/HUDI-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17433586#comment-17433586 ] Raymond Xu commented on HUDI-2287: -- [~rjkumr] it's likely caused by your `hoodie.table.partition.fields` config in your hoodie.properties. As you're using a CustomKeyGenerator, not sure how that affects the partition field settings. In case of SimpleKeyGenerator, you'd expect `hoodie.table.partition.fields=partition1,partition2`. You can manually modify it and by setting it right for your CustomKeyGenerator's logic, you should be able to get partition pruning to work. > Partition pruning not working on Hudi dataset > - > > Key: HUDI-2287 > URL: https://issues.apache.org/jira/browse/HUDI-2287 > Project: Apache Hudi > Issue Type: Sub-task > Components: Performance >Reporter: Rajkumar Gunasekaran >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.10.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Hi, we have created a Hudi dataset which has two level partition like this > {code:java} > s3://somes3bucket/partition1=value/partition2=value > {code} > where _partition1_ and _partition2_ is of type string > When running a simple count query using Hudi format in spark-shell, it takes > almost 3 minutes to complete > > {code:scala} > spark.read.format("hudi").load("s3://somes3bucket"). > where("partition1 = 'somevalue' and partition2 = 'somevalue'"). > count() > > res1: Long = > attempt 1: 3.2 minutes > attempt 2: 2.5 minutes > {code} > In the Spark UI ~9000 tasks (which is approximately equivalent to the total > no of files in the ENTIRE dataset s3://somes3bucket) are used for > computation. Seems like spark is reading the entire dataset instead of > *partition pruning.*...and then filtering the dataset based on the where > clause > Whereas, if I use the parquet format to read the dataset, the query only > takes ~30 seconds (vis-a-vis 3 minutes with Hudi format) > {code:scala} > spark.read.parquet("s3://somes3bucket"). > where("partition1 = 'somevalue' and partition2 = 'somevalue'"). > count() > res2: Long = > ~ 30 seconds > {code} > In the spark UI, only 1361 (ie 1361 tasks) files are scanned (vis-a-vis ~9000 > files in Hudi) and takes only 15 seconds > Any idea why partition pruning is not working when using Hudi format? > Wondering if I am missing any configuration during the creation of the > dataset? > PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0 and here is > the configuration I have used for creating the dataset > {code:scala} > df.writeStream > .trigger(Trigger.ProcessingTime(s"${param.triggerTimeInSeconds} seconds")) > .partitionBy("partition1","partition2") > .format("org.apache.hudi") > .option(HoodieWriteConfig.TABLE_NAME, param.hiveNHudiTableName.get) > //-- > .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy") > .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, > param.expectedFileSizeInBytes) > .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, > HoodieStorageConfig.DEFAULT_PARQUET_BLOCK_SIZE_BYTES) > //-- > .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, > (param.expectedFileSizeInBytes / 100) * 80) > .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "true") > .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, > param.runCompactionAfterNDeltaCommits.get) > //-- > .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL) > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "record_key_id") > .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, > classOf[CustomKeyGenerator].getName) > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, > "partition1:SIMPLE,partition2:SIMPLE") > .option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, > hudiTablePrecombineKey) > .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") > //.option(DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY, "false") > .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true") > .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, > "partition1,partition2") > .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, param.hiveDb.get) > .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, > param.hiveNHudiTableName.get) > .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > classOf[MultiPartKeysValueExtractor].getName) > .outputMode(OutputMode.Append()) > .queryName(s"${param.hiveDb}_${param.hiveNHudiTableName}_query"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] hudi-bot edited a comment on pull request #3858: [MINOR] Fix README for hudi-kafka-connect
hudi-bot edited a comment on pull request #3858: URL: https://github.com/apache/hudi/pull/3858#issuecomment-950564845 ## CI report: * f2ed52360c22cba5bbade224be9b3a6cec660d36 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2827) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #3858: [MINOR] Fix README for hudi-kafka-connect
hudi-bot commented on pull request #3858: URL: https://github.com/apache/hudi/pull/3858#issuecomment-950564845 ## CI report: * f2ed52360c22cba5bbade224be9b3a6cec660d36 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot edited a comment on pull request #3857: [WIP][HUDI-2332] Add clustering and compaction in Kafka Connect Sink
hudi-bot edited a comment on pull request #3857: URL: https://github.com/apache/hudi/pull/3857#issuecomment-950560156 ## CI report: * 34cb663a0afb4362af0795384058378ef6ec130a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2825) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua opened a new pull request #3858: [MINOR] Fix README for hudi-kafka-connect
yihua opened a new pull request #3858: URL: https://github.com/apache/hudi/pull/3858 ## What is the purpose of the pull request This PR fixes the tutorial in README.md for hudi-kafka-connect. ## Brief change log - Edits to the commands so that they are runnable. ## Verify this pull request Successfully run the commands in the tutorial to make sure Kafka Connect Sink for Hudi can be set up locally. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #3857: [WIP][HUDI-2332] Add clustering and compaction in Kafka Connect Sink
hudi-bot commented on pull request #3857: URL: https://github.com/apache/hudi/pull/3857#issuecomment-950560156 ## CI report: * 34cb663a0afb4362af0795384058378ef6ec130a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-2332) Implement scheduling of compaction/ clustering for Kafka Connect
[ https://issues.apache.org/jira/browse/HUDI-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-2332: - Labels: pull-request-available (was: ) > Implement scheduling of compaction/ clustering for Kafka Connect > > > Key: HUDI-2332 > URL: https://issues.apache.org/jira/browse/HUDI-2332 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Rajesh Mahindra >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 0.10.0 > > > * Implement compaction/ clustering etc. from Java client > * Schedule from Coordinator -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] yihua opened a new pull request #3857: [WIP][HUDI-2332] Add clustering and compaction in Kafka Connect Sink
yihua opened a new pull request #3857: URL: https://github.com/apache/hudi/pull/3857 ## What is the purpose of the pull request This PR adds the functionality of clustering and compaction in Kafka Connect Sink for Hudi. ## Brief change log ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot edited a comment on pull request #3802: [HUDI-1500] Support replace commit in DeltaSync with commit metadata preserved
hudi-bot edited a comment on pull request #3802: URL: https://github.com/apache/hudi/pull/3802#issuecomment-943342747 ## CI report: * b63edfaca889ac6444b61a525cc9ee1065f610db Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2824) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-2077) Flaky test: TestHoodieDeltaStreamer
[ https://issues.apache.org/jira/browse/HUDI-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2077: - Priority: Critical (was: Major) > Flaky test: TestHoodieDeltaStreamer > --- > > Key: HUDI-2077 > URL: https://issues.apache.org/jira/browse/HUDI-2077 > Project: Apache Hudi > Issue Type: Sub-task > Components: Testing >Reporter: Raymond Xu >Assignee: Sagar Sumit >Priority: Critical > Labels: pull-request-available > Attachments: 28.txt, hudi_2077_schema_mismatch.txt > > > {code:java} > [INFO] Results:8520[INFO] 8521[ERROR] Errors: 8522[ERROR] > TestHoodieDeltaStreamer.testUpsertsMORContinuousModeWithMultipleWriters:716->testUpsertsContinuousModeWithMultipleWriters:831->runJobsInParallel:940 > » Execution{code} > Search "testUpsertsMORContinuousModeWithMultipleWriters" in the log file for > details. > {quote} > 1730667 [pool-1461-thread-1] WARN > org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer - Got error : }} > org.apache.hudi.exception.HoodieIOException: Could not check if > hdfs://localhost:4/user/vsts/continuous_mor_mulitwriter is a valid table > at > org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:59) > > at > org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:112) > > at > org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:73) > > at > org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:606) > > at > org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer$TestHelpers.assertAtleastNDeltaCommitsAfterCommit(TestHoodieDeltaStreamer.java:322) > > at > org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.lambda$runJobsInParallel$8(TestHoodieDeltaStreamer.java:906) > > at > org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer$TestHelpers.lambda$waitTillCondition$0(TestHoodieDeltaStreamer.java:347) > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > {{Caused by: java.net.ConnectException: Call From fv-az238-328/10.1.0.24 to > localhost:4 failed on connection exception: java.net.ConnectException: > Connection refused; For more details see: > [http://wiki.apache.org/hadoop/ConnectionRefused] > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1706) Test flakiness w/ multiwriter test
[ https://issues.apache.org/jira/browse/HUDI-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1706: - Priority: Major (was: Blocker) > Test flakiness w/ multiwriter test > -- > > Key: HUDI-1706 > URL: https://issues.apache.org/jira/browse/HUDI-1706 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: sivabalan narayanan >Assignee: Raymond Xu >Priority: Major > Fix For: 0.10.0 > > > [https://api.travis-ci.com/v3/job/492130170/log.txt] > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2614) Remove duplicated hadoop-hdfs with tests classifier exists in bundles
vinoyang created HUDI-2614: -- Summary: Remove duplicated hadoop-hdfs with tests classifier exists in bundles Key: HUDI-2614 URL: https://issues.apache.org/jira/browse/HUDI-2614 Project: Apache Hudi Issue Type: Sub-task Reporter: vinoyang Assignee: vinoyang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2600) Remove duplicated hadoop-common with tests classifier exists in bundles
[ https://issues.apache.org/jira/browse/HUDI-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang updated HUDI-2600: --- Fix Version/s: 0.10.0 > Remove duplicated hadoop-common with tests classifier exists in bundles > --- > > Key: HUDI-2600 > URL: https://issues.apache.org/jira/browse/HUDI-2600 > Project: Apache Hudi > Issue Type: Sub-task > Components: Release & Administrative >Reporter: vinoyang >Assignee: vinoyang >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > We found many duplicated dependencies in the generated dependency list, > `hadoop-common` is one of them: > {code:java} > hadoop-common/org.apache.hadoop/2.7.3//hadoop-common-2.7.3.jar > hadoop-common/org.apache.hadoop/2.7.3/tests/hadoop-common-2.7.3-tests.jar > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-2600) Remove duplicated hadoop-common with tests classifier exists in bundles
[ https://issues.apache.org/jira/browse/HUDI-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang closed HUDI-2600. -- Resolution: Done 220bf6a7e6f5cdf0efbbbee9df6852a8b2288570 > Remove duplicated hadoop-common with tests classifier exists in bundles > --- > > Key: HUDI-2600 > URL: https://issues.apache.org/jira/browse/HUDI-2600 > Project: Apache Hudi > Issue Type: Sub-task > Components: Release & Administrative >Reporter: vinoyang >Assignee: vinoyang >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > We found many duplicated dependencies in the generated dependency list, > `hadoop-common` is one of them: > {code:java} > hadoop-common/org.apache.hadoop/2.7.3//hadoop-common-2.7.3.jar > hadoop-common/org.apache.hadoop/2.7.3/tests/hadoop-common-2.7.3-tests.jar > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[hudi] branch master updated: [HUDI-2600] Remove duplicated hadoop-common with tests classifier exists in bundles (#3847)
This is an automated email from the ASF dual-hosted git repository. vinoyang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 220bf6a [HUDI-2600] Remove duplicated hadoop-common with tests classifier exists in bundles (#3847) 220bf6a is described below commit 220bf6a7e6f5cdf0efbbbee9df6852a8b2288570 Author: vinoyang AuthorDate: Mon Oct 25 13:45:28 2021 +0800 [HUDI-2600] Remove duplicated hadoop-common with tests classifier exists in bundles (#3847) --- dependencies/hudi-flink-bundle_2.11.txt | 6 +++--- dependencies/hudi-hive-sync-bundle.txt | 7 +-- dependencies/hudi-kafka-connect-bundle.txt | 3 +-- dependencies/hudi-spark-bundle_2.11.txt | 3 +-- dependencies/hudi-timeline-server-bundle.txt | 1 - dependencies/hudi-utilities-bundle_2.11.txt | 3 +-- hudi-client/hudi-client-common/pom.xml | 1 + hudi-sync/hudi-hive-sync/pom.xml | 1 + hudi-timeline-service/pom.xml| 1 + 9 files changed, 10 insertions(+), 16 deletions(-) diff --git a/dependencies/hudi-flink-bundle_2.11.txt b/dependencies/hudi-flink-bundle_2.11.txt index b97995c..4414594 100644 --- a/dependencies/hudi-flink-bundle_2.11.txt +++ b/dependencies/hudi-flink-bundle_2.11.txt @@ -64,7 +64,7 @@ commons-lang/commons-lang/2.6//commons-lang-2.6.jar commons-lang3/org.apache.commons/3.1//commons-lang3-3.1.jar commons-logging/commons-logging/1.2//commons-logging-1.2.jar commons-math/org.apache.commons/2.2//commons-math-2.2.jar -commons-math3/org.apache.commons/3.1.1//commons-math3-3.1.1.jar +commons-math3/org.apache.commons/3.5//commons-math3-3.5.jar commons-net/commons-net/3.1//commons-net-3.1.jar commons-pool/commons-pool/1.6//commons-pool-1.6.jar config/com.typesafe/1.3.3//config-1.3.3.jar @@ -107,6 +107,7 @@ force-shading/org.apache.flink/1.13.1//force-shading-1.13.1.jar grizzled-slf4j_2.11/org.clapper/1.3.2//grizzled-slf4j_2.11-1.3.2.jar groovy-all/org.codehaus.groovy/2.4.4//groovy-all-2.4.4.jar gson/com.google.code.gson/2.3.1//gson-2.3.1.jar +guava/com.google.guava/12.0.1//guava-12.0.1.jar guice-assistedinject/com.google.inject.extensions/3.0//guice-assistedinject-3.0.jar guice-servlet/com.google.inject.extensions/3.0//guice-servlet-3.0.jar guice/com.google.inject/3.0//guice-3.0.jar @@ -114,7 +115,6 @@ hadoop-annotations/org.apache.hadoop/2.7.3//hadoop-annotations-2.7.3.jar hadoop-auth/org.apache.hadoop/2.7.3//hadoop-auth-2.7.3.jar hadoop-client/org.apache.hadoop/2.7.3//hadoop-client-2.7.3.jar hadoop-common/org.apache.hadoop/2.7.3//hadoop-common-2.7.3.jar -hadoop-common/org.apache.hadoop/2.7.3/tests/hadoop-common-2.7.3-tests.jar hadoop-hdfs/org.apache.hadoop/2.7.3//hadoop-hdfs-2.7.3.jar hadoop-hdfs/org.apache.hadoop/2.7.3/tests/hadoop-hdfs-2.7.3-tests.jar hadoop-mapreduce-client-app/org.apache.hadoop/2.7.3//hadoop-mapreduce-client-app-2.7.3.jar @@ -132,7 +132,7 @@ hadoop-yarn-server-resourcemanager/org.apache.hadoop/2.7.2//hadoop-yarn-server-r hadoop-yarn-server-web-proxy/org.apache.hadoop/2.7.2//hadoop-yarn-server-web-proxy-2.7.2.jar hamcrest-core/org.hamcrest/1.3//hamcrest-core-1.3.jar hbase-annotations/org.apache.hbase/1.2.3//hbase-annotations-1.2.3.jar -hbase-client/org.apache.hbase/1.1.1//hbase-client-1.1.1.jar +hbase-client/org.apache.hbase/1.2.3//hbase-client-1.2.3.jar hbase-common/org.apache.hbase/1.2.3//hbase-common-1.2.3.jar hbase-common/org.apache.hbase/1.2.3/tests/hbase-common-1.2.3-tests.jar hbase-hadoop-compat/org.apache.hbase/1.2.3//hbase-hadoop-compat-1.2.3.jar diff --git a/dependencies/hudi-hive-sync-bundle.txt b/dependencies/hudi-hive-sync-bundle.txt index aefcfbb..f80ee31 100644 --- a/dependencies/hudi-hive-sync-bundle.txt +++ b/dependencies/hudi-hive-sync-bundle.txt @@ -56,7 +56,6 @@ hadoop-annotations/org.apache.hadoop/2.7.3//hadoop-annotations-2.7.3.jar hadoop-auth/org.apache.hadoop/2.7.3//hadoop-auth-2.7.3.jar hadoop-client/org.apache.hadoop/2.7.3//hadoop-client-2.7.3.jar hadoop-common/org.apache.hadoop/2.7.3//hadoop-common-2.7.3.jar -hadoop-common/org.apache.hadoop/2.7.3/tests/hadoop-common-2.7.3-tests.jar hadoop-hdfs/org.apache.hadoop/2.7.3//hadoop-hdfs-2.7.3.jar hadoop-hdfs/org.apache.hadoop/2.7.3/tests/hadoop-hdfs-2.7.3-tests.jar hadoop-mapreduce-client-app/org.apache.hadoop/2.7.3//hadoop-mapreduce-client-app-2.7.3.jar @@ -87,9 +86,7 @@ jackson-annotations/com.fasterxml.jackson.core/2.6.7//jackson-annotations-2.6.7. jackson-core-asl/org.codehaus.jackson/1.9.13//jackson-core-asl-1.9.13.jar jackson-core/com.fasterxml.jackson.core/2.6.7//jackson-core-2.6.7.jar jackson-databind/com.fasterxml.jackson.core/2.6.7.3//jackson-databind-2.6.7.3.jar -jackson-jaxrs/org.codehaus.jackson/1.9.13//jackson-jaxrs-1.9.13.jar jackson-mapper-asl/org.codehaus.jackson/1.9.13//jackson-mapper-asl-1.9.13.jar -jackson-xc/org.codehaus.jackson/1.9.13//jackson-xc-1.9.13.jar jamon-runtime/org.jamon/2.4.1//jamon-runti
[GitHub] [hudi] yanghua merged pull request #3847: [HUDI-2600] Remove duplicated hadoop-common with tests classifier exists in bundles
yanghua merged pull request #3847: URL: https://github.com/apache/hudi/pull/3847 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a change in pull request #3762: [HUDI-1294] Adding inline read and seek based read(batch get) for hfile log blocks in metadata table
nsivabalan commented on a change in pull request #3762: URL: https://github.com/apache/hudi/pull/3762#discussion_r735269130 ## File path: hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java ## @@ -200,8 +201,49 @@ protected BaseTableMetadata(HoodieEngineContext engineContext, HoodieMetadataCon return statuses; } + Map fetchAllFilesInPartitionPaths(List partitionPaths) throws IOException { Review comment: all of our tests in TestHoodieBackedMetadata uses HoodieBackedTableMetadata for assertions. I already tried writing unit tests, and felt its already covered and hence did not write one explicitly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #3762: [HUDI-1294] Adding inline read and seek based read(batch get) for hfile log blocks in metadata table
nsivabalan commented on pull request #3762: URL: https://github.com/apache/hudi/pull/3762#issuecomment-950546155 @prashantwason : Can you review the patch please when you get time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a change in pull request #3762: [HUDI-1294] Adding inline read and seek based read(batch get) for hfile log blocks in metadata table
nsivabalan commented on a change in pull request #3762: URL: https://github.com/apache/hudi/pull/3762#discussion_r735268438 ## File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java ## @@ -120,65 +120,114 @@ private void initIfNeeded() { } @Override - protected Option> getRecordByKeyFromMetadata(String key, String partitionName) { -Pair readers = openReadersIfNeeded(key, partitionName); + protected Option> getRecordByKey(String key, String partitionName) { +return getRecordsByKeys(Collections.singletonList(key), partitionName).get(0).getValue(); + } + + protected List>>> getRecordsByKeys(List keys, String partitionName) { Review comment: 1. I see your point regarding inline/full scan to be configurable depending on diff partitions in metadata. I will address this. I guess, FILES will do full scan. while col_stats(min max stats), bloom_filter and record index will do inline. 2. If we go with (1), not sure if we need to try out (2) as well. with (1). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #3827: [HUDI-2573] Fixing double locking with multi-writers
nsivabalan commented on pull request #3827: URL: https://github.com/apache/hudi/pull/3827#issuecomment-950539640 @manojpec : thanks for your inputs. I do like the idea of TransactionManager handling the locking depending on whether the lock acquisition is requested by same owner or diff. But I see some impl hurdles in that. Let me see how I can go about that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan merged pull request #3757: [HUDI-2005] Avoiding direct fs calls in HoodieLogFileReader
nsivabalan merged pull request #3757: URL: https://github.com/apache/hudi/pull/3757 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-2005] Avoiding direct fs calls in HoodieLogFileReader (#3757)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 1bb0532 [HUDI-2005] Avoiding direct fs calls in HoodieLogFileReader (#3757) 1bb0532 is described below commit 1bb05325637740498cac548872cf7223e34950d0 Author: Sivabalan Narayanan AuthorDate: Mon Oct 25 01:21:08 2021 -0400 [HUDI-2005] Avoiding direct fs calls in HoodieLogFileReader (#3757) --- .../hudi/common/table/log/HoodieLogFileReader.java | 12 ++-- .../apache/hudi/common/table/log/HoodieLogFormat.java | 2 +- .../hudi/common/table/log/HoodieLogFormatReader.java | 4 ++-- .../apache/hudi/common/table/log/LogReaderUtils.java | 18 +++--- .../hudi/metadata/HoodieMetadataFileSystemView.java| 2 +- .../hadoop/realtime/AbstractRealtimeRecordReader.java | 2 +- .../hudi/hadoop/realtime/HoodieRealtimeFileSplit.java | 12 ++-- .../realtime/RealtimeBootstrapBaseFileSplit.java | 13 +++-- .../org/apache/hudi/hadoop/realtime/RealtimeSplit.java | 3 +++ .../hadoop/utils/HoodieRealtimeInputFormatUtils.java | 5 +++-- .../hadoop/realtime/TestHoodieRealtimeFileSplit.java | 5 - .../realtime/TestHoodieRealtimeRecordReader.java | 17 + 12 files changed, 62 insertions(+), 33 deletions(-) diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java index f0f3842..88b7e32 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java @@ -74,6 +74,11 @@ public class HoodieLogFileReader implements HoodieLogFormat.Reader { private transient Thread shutdownThread = null; public HoodieLogFileReader(FileSystem fs, HoodieLogFile logFile, Schema readerSchema, int bufferSize, + boolean readBlockLazily) throws IOException { +this(fs, logFile, readerSchema, bufferSize, readBlockLazily, false); + } + + public HoodieLogFileReader(FileSystem fs, HoodieLogFile logFile, Schema readerSchema, int bufferSize, boolean readBlockLazily, boolean reverseReader) throws IOException { FSDataInputStream fsDataInputStream = fs.open(logFile.getPath(), bufferSize); this.logFile = logFile; @@ -82,16 +87,11 @@ public class HoodieLogFileReader implements HoodieLogFormat.Reader { this.readBlockLazily = readBlockLazily; this.reverseReader = reverseReader; if (this.reverseReader) { - this.reverseLogFilePosition = this.lastReverseLogFilePosition = fs.getFileStatus(logFile.getPath()).getLen(); + this.reverseLogFilePosition = this.lastReverseLogFilePosition = logFile.getFileSize(); } addShutDownHook(); } - public HoodieLogFileReader(FileSystem fs, HoodieLogFile logFile, Schema readerSchema, boolean readBlockLazily, - boolean reverseReader) throws IOException { -this(fs, logFile, readerSchema, DEFAULT_BUFFER_SIZE, readBlockLazily, reverseReader); - } - public HoodieLogFileReader(FileSystem fs, HoodieLogFile logFile, Schema readerSchema) throws IOException { this(fs, logFile, readerSchema, DEFAULT_BUFFER_SIZE, false, false); } diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormat.java b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormat.java index c566788..569b4a2 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormat.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormat.java @@ -274,7 +274,7 @@ public interface HoodieLogFormat { static HoodieLogFormat.Reader newReader(FileSystem fs, HoodieLogFile logFile, Schema readerSchema) throws IOException { -return new HoodieLogFileReader(fs, logFile, readerSchema, HoodieLogFileReader.DEFAULT_BUFFER_SIZE, false, false); +return new HoodieLogFileReader(fs, logFile, readerSchema, HoodieLogFileReader.DEFAULT_BUFFER_SIZE, false); } static HoodieLogFormat.Reader newReader(FileSystem fs, HoodieLogFile logFile, Schema readerSchema, diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatReader.java b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatReader.java index 7267227..e64e1a1 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatReader.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatReader.java @@ -59,7 +59,7 @@ public class HoodieLogFormatReader implements HoodieLogFormat.Reader { this.prevReadersInOpenState = new ArrayList<>(); if (logFiles.size() > 0) { HoodieLogFile n
[GitHub] [hudi] nsivabalan commented on a change in pull request #3757: [HUDI-2005] Avoiding direct fs calls in HoodieLogFileReader
nsivabalan commented on a change in pull request #3757: URL: https://github.com/apache/hudi/pull/3757#discussion_r735263839 ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeSplit.java ## @@ -41,6 +42,8 @@ */ List getDeltaLogPaths(); Review comment: yes, will take it up as a [follow up.](https://issues.apache.org/jira/browse/HUDI-2613) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-2613) Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus
[ https://issues.apache.org/jira/browse/HUDI-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-2613: - Assignee: sivabalan narayanan > Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus > > > Key: HUDI-2613 > URL: https://issues.apache.org/jira/browse/HUDI-2613 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > > Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus instead of > getDeltalogs() -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2613) Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus
[ https://issues.apache.org/jira/browse/HUDI-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-2613: -- Parent: HUDI-1292 Issue Type: Sub-task (was: Improvement) > Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus > > > Key: HUDI-2613 > URL: https://issues.apache.org/jira/browse/HUDI-2613 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: sivabalan narayanan >Priority: Major > > Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus instead of > getDeltalogs() -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2613) Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus
[ https://issues.apache.org/jira/browse/HUDI-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-2613: -- Fix Version/s: 0.10.0 > Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus > > > Key: HUDI-2613 > URL: https://issues.apache.org/jira/browse/HUDI-2613 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Fix For: 0.10.0 > > > Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus instead of > getDeltalogs() -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2613) Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus
sivabalan narayanan created HUDI-2613: - Summary: Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus Key: HUDI-2613 URL: https://issues.apache.org/jira/browse/HUDI-2613 Project: Apache Hudi Issue Type: Improvement Reporter: sivabalan narayanan Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus instead of getDeltalogs() -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] hudi-bot edited a comment on pull request #3802: [HUDI-1500] Support replace commit in DeltaSync with commit metadata preserved
hudi-bot edited a comment on pull request #3802: URL: https://github.com/apache/hudi/pull/3802#issuecomment-943342747 ## CI report: * e906d363c06635bbcc7c69db5fcc4ff0f0f2d919 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2741) * b63edfaca889ac6444b61a525cc9ee1065f610db Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2824) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot edited a comment on pull request #3802: [HUDI-1500] Support replace commit in DeltaSync with commit metadata preserved
hudi-bot edited a comment on pull request #3802: URL: https://github.com/apache/hudi/pull/3802#issuecomment-943342747 ## CI report: * e906d363c06635bbcc7c69db5fcc4ff0f0f2d919 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2741) * b63edfaca889ac6444b61a525cc9ee1065f610db UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot edited a comment on pull request #3813: [HUDI-2563][hudi-client] Refactor CompactionTriggerStrategy.
hudi-bot edited a comment on pull request #3813: URL: https://github.com/apache/hudi/pull/3813#issuecomment-944948402 ## CI report: * 7a7ee072ae225fe015b73545ac8d50acc5746ea7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2822) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-2077] Fix TestHoodieDeltaStreamerWithMultiWriter (#3849)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new d856037 [HUDI-2077] Fix TestHoodieDeltaStreamerWithMultiWriter (#3849) d856037 is described below commit d8560377c306e49b7e58448b6897e9c0e7719f61 Author: Raymond Xu <2701446+xushi...@users.noreply.github.com> AuthorDate: Sun Oct 24 21:14:39 2021 -0700 [HUDI-2077] Fix TestHoodieDeltaStreamerWithMultiWriter (#3849) Remove the logic of using deltastreamer to prep test table. Use fixture (compressed test table) instead. --- .../SparkClientFunctionalTestHarness.java | 8 +- .../hudi/testutils/providers/SparkProvider.java| 4 +- .../apache/hudi/common/testutils/FixtureUtils.java | 81 + .../common/testutils/HoodieTestDataGenerator.java | 5 +- .../functional/TestHoodieDeltaStreamer.java| 12 +-- .../TestHoodieDeltaStreamerWithMultiWriter.java| 96 ++--- .../functional/TestJdbcbasedSchemaProvider.java| 11 ++- .../testutils/sources/AbstractBaseTestSource.java | 26 +- ...inuousModeWithMultipleWriters.COPY_ON_WRITE.zip | Bin 0 -> 2494616 bytes ...inuousModeWithMultipleWriters.MERGE_ON_READ.zip | Bin 0 -> 2910151 bytes 10 files changed, 178 insertions(+), 65 deletions(-) diff --git a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/SparkClientFunctionalTestHarness.java b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/SparkClientFunctionalTestHarness.java index 74ab52d..aca1d83 100644 --- a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/SparkClientFunctionalTestHarness.java +++ b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/SparkClientFunctionalTestHarness.java @@ -176,8 +176,14 @@ public class SparkClientFunctionalTestHarness implements SparkProvider, HoodieMe } } + /** + * To clean up Spark resources after all testcases have run in functional tests. + * + * Spark session and contexts were reused for testcases in the same test class. Some + * testcase may invoke this specifically to clean up in case of repeated test runs. + */ @AfterAll - public static synchronized void cleanUpAfterAll() { + public static synchronized void resetSpark() { if (spark != null) { spark.close(); spark = null; diff --git a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/providers/SparkProvider.java b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/providers/SparkProvider.java index be15dc8..92b1f76 100644 --- a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/providers/SparkProvider.java +++ b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/providers/SparkProvider.java @@ -39,6 +39,8 @@ public interface SparkProvider extends org.apache.hudi.testutils.providers.Hoodi SparkConf sparkConf = new SparkConf(); sparkConf.set("spark.app.name", getClass().getName()); sparkConf.set("spark.master", "local[*]"); +sparkConf.set("spark.default.parallelism", "4"); +sparkConf.set("spark.sql.shuffle.partitions", "4"); sparkConf.set("spark.driver.maxResultSize", "2g"); sparkConf.set("spark.hadoop.mapred.output.compress", "true"); sparkConf.set("spark.hadoop.mapred.output.compression.codec", "true"); @@ -52,4 +54,4 @@ public interface SparkProvider extends org.apache.hudi.testutils.providers.Hoodi default SparkConf conf() { return conf(Collections.emptyMap()); } -} \ No newline at end of file +} diff --git a/hudi-common/src/test/java/org/apache/hudi/common/testutils/FixtureUtils.java b/hudi-common/src/test/java/org/apache/hudi/common/testutils/FixtureUtils.java new file mode 100644 index 000..6dfe0da --- /dev/null +++ b/hudi-common/src/test/java/org/apache/hudi/common/testutils/FixtureUtils.java @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.common.testutils; + +import java.io.File; +import java.io.FileInputStream; +
[GitHub] [hudi] xushiyan merged pull request #3849: [HUDI-2077] Fix TestHoodieDeltaStreamerWithMultiWriter
xushiyan merged pull request #3849: URL: https://github.com/apache/hudi/pull/3849 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on pull request #3849: [HUDI-2077] Fix TestHoodieDeltaStreamerWithMultiWriter
xushiyan commented on pull request #3849: URL: https://github.com/apache/hudi/pull/3849#issuecomment-950511022 Build passed https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=2820&view=results -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Cherry-Puppy removed a comment on issue #3680: [SUPPORT]Failed to sync data to hive-3.1.2 by flink-sql
Cherry-Puppy removed a comment on issue #3680: URL: https://github.com/apache/hudi/issues/3680#issuecomment-950498398 I also encountered this problem. I still can't find this class after changing the hive version. But there is this class in the jar package. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Cherry-Puppy commented on issue #3680: [SUPPORT]Failed to sync data to hive-3.1.2 by flink-sql
Cherry-Puppy commented on issue #3680: URL: https://github.com/apache/hudi/issues/3680#issuecomment-950503446 @danny0405 I also encountered this problem. I still can't find this class after changing the hive version. But there is this class in the jar package. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Cherry-Puppy commented on issue #3680: [SUPPORT]Failed to sync data to hive-3.1.2 by flink-sql
Cherry-Puppy commented on issue #3680: URL: https://github.com/apache/hudi/issues/3680#issuecomment-950498398 I also encountered this problem. I still can't find this class after changing the hive version. But there is this class in the jar package. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot edited a comment on pull request #3813: [HUDI-2563][hudi-client] Refactor CompactionTriggerStrategy.
hudi-bot edited a comment on pull request #3813: URL: https://github.com/apache/hudi/pull/3813#issuecomment-944948402 ## CI report: * 822dbe03dc77531858ffd83ebeb91f210f4e7851 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2818) * 7a7ee072ae225fe015b73545ac8d50acc5746ea7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2822) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot edited a comment on pull request #3813: [HUDI-2563][hudi-client] Refactor CompactionTriggerStrategy.
hudi-bot edited a comment on pull request #3813: URL: https://github.com/apache/hudi/pull/3813#issuecomment-944948402 ## CI report: * 822dbe03dc77531858ffd83ebeb91f210f4e7851 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2818) * 7a7ee072ae225fe015b73545ac8d50acc5746ea7 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot edited a comment on pull request #3330: [HUDI-2101][RFC-28]support z-order for hudi
hudi-bot edited a comment on pull request #3330: URL: https://github.com/apache/hudi/pull/3330#issuecomment-885350571 ## CI report: * 133379deca564ca42f10a1f3e59bb4aa17d80964 UNKNOWN * e555754a4ea179e5251cd7bbff7e8d20c02ef7c8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2821) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs opened a new issue #3856: [SUPPORT] Maybe should cache baseDir in nonHoodiePathCache in HoodieROTablePathFilter?
boneanxs opened a new issue #3856: URL: https://github.com/apache/hudi/issues/3856 For a non hoodie table, with table path: `hdfs://test/warehouse/db/table`, 3 partition columns(p1, p2, p3), for a specific partition, like(p1=A, p2=B, p3=C), the path should be `hdfs://test/warehouse/db/table/p1=A/p2=B/p3=C`, HoodieROTablePathFilter will check baseDir(hdfs://test/warehouse/db/table) is a valid HoodieTable path or not, othervise, cache `hdfs://test/warehouse/db/table/p1=A/p2=B/p3=C` in nonHoodiePathCache. I'm wondering why don't we cache baseDir in nonHoodiePathCache, if we cache baseDir, for other partitions(like p1=A1, p2=B1, p3=C1), we only check if baseDir in nonHoodiePathCache or not. Pls correct me if I'm wrong. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dongkelun commented on a change in pull request #3700: [HUDI-2471] Add support ignoring case when column name matches in merge into
dongkelun commented on a change in pull request #3700: URL: https://github.com/apache/hudi/pull/3700#discussion_r735231887 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala ## @@ -163,15 +163,15 @@ case class HoodieResolveReferences(sparkSession: SparkSession) extends Rule[Logi // assignments is empty means insert * or update set * val resolvedSourceOutputWithoutMetaFields = resolvedSource.output.filter(attr => !HoodieSqlUtils.isMetaField(attr.name)) val targetOutputWithoutMetaFields = target.output.filter(attr => !HoodieSqlUtils.isMetaField(attr.name)) - val resolvedSourceColumnNamesWithoutMetaFields = resolvedSourceOutputWithoutMetaFields.map(_.name) - val targetColumnNamesWithoutMetaFields = targetOutputWithoutMetaFields.map(_.name) + val resolvedSourceColumnNamesWithoutMetaFields = resolvedSourceOutputWithoutMetaFields.map(_.name.toLowerCase) Review comment: @YannByron yes,it can work in column name matching.Do I need to add a test case for upper case column name definition? However, ignoring case matching has not been implemented in condition and action. I think we should support it. Do I submit another PR or support it in this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-2612) No need to define primary key for flink insert operation
Danny Chen created HUDI-2612: Summary: No need to define primary key for flink insert operation Key: HUDI-2612 URL: https://issues.apache.org/jira/browse/HUDI-2612 Project: Apache Hudi Issue Type: Improvement Components: Flink Integration Reporter: Danny Chen Fix For: 0.10.0 There is one exception: the MOR table may still needs the pk to generate {{HoodieKey}} for #preCombine and compaction merge. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] vinothchandar commented on pull request #3330: [HUDI-2101][RFC-28]support z-order for hudi
vinothchandar commented on pull request #3330: URL: https://github.com/apache/hudi/pull/3330#issuecomment-950475158 Thanks for your patience. Definitely on it. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YannByron commented on a change in pull request #3700: [HUDI-2471] Add support ignoring case when column name matches in merge into
YannByron commented on a change in pull request #3700: URL: https://github.com/apache/hudi/pull/3700#discussion_r735220808 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala ## @@ -163,15 +163,15 @@ case class HoodieResolveReferences(sparkSession: SparkSession) extends Rule[Logi // assignments is empty means insert * or update set * val resolvedSourceOutputWithoutMetaFields = resolvedSource.output.filter(attr => !HoodieSqlUtils.isMetaField(attr.name)) val targetOutputWithoutMetaFields = target.output.filter(attr => !HoodieSqlUtils.isMetaField(attr.name)) - val resolvedSourceColumnNamesWithoutMetaFields = resolvedSourceOutputWithoutMetaFields.map(_.name) - val targetColumnNamesWithoutMetaFields = targetOutputWithoutMetaFields.map(_.name) + val resolvedSourceColumnNamesWithoutMetaFields = resolvedSourceOutputWithoutMetaFields.map(_.name.toLowerCase) Review comment: If the table field is defined in uppercase letters, does that work? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot edited a comment on pull request #3330: [HUDI-2101][RFC-28]support z-order for hudi
hudi-bot edited a comment on pull request #3330: URL: https://github.com/apache/hudi/pull/3330#issuecomment-885350571 ## CI report: * 133379deca564ca42f10a1f3e59bb4aa17d80964 UNKNOWN * 8236ece4816e100af13702bf92fdddf9c5e14eaf Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2428) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2430) * e555754a4ea179e5251cd7bbff7e8d20c02ef7c8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2821) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on pull request #3330: [HUDI-2101][RFC-28]support z-order for hudi
xiarixiaoyao commented on pull request #3330: URL: https://github.com/apache/hudi/pull/3330#issuecomment-950471560 @vinothchandar already rebase the code. could you help me review this code, thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot edited a comment on pull request #3330: [HUDI-2101][RFC-28]support z-order for hudi
hudi-bot edited a comment on pull request #3330: URL: https://github.com/apache/hudi/pull/3330#issuecomment-885350571 ## CI report: * 133379deca564ca42f10a1f3e59bb4aa17d80964 UNKNOWN * 8236ece4816e100af13702bf92fdddf9c5e14eaf Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2428) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2430) * e555754a4ea179e5251cd7bbff7e8d20c02ef7c8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot edited a comment on pull request #3849: [HUDI-2077] Fix TestHoodieDeltaStreamerWithMultiWriter
hudi-bot edited a comment on pull request #3849: URL: https://github.com/apache/hudi/pull/3849#issuecomment-950068934 ## CI report: * f623c7545b41a70eb607d530428536567e70fb7a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2820) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on issue #3854: [SUPPORT] Lower performance using 0.9.0 vs 0.8.0
xushiyan commented on issue #3854: URL: https://github.com/apache/hudi/issues/3854#issuecomment-950455798 @Limess thanks for providing benchmarks! > bulk inserts are slightly faster with Hudi 0.9.0 This is most likely due to row writer enabled by default in 0.9.0 https://hudi.apache.org/docs/configurations#hoodiedatasourcewriterowwriterenable that boots bulkinsert in 0.9 @nsivabalan do you have any hints on what changes in 0.9 might cause the slower writes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot edited a comment on pull request #3849: [HUDI-2077] Fix TestHoodieDeltaStreamerWithMultiWriter
hudi-bot edited a comment on pull request #3849: URL: https://github.com/apache/hudi/pull/3849#issuecomment-950068934 ## CI report: * 16b061c5fa2b1d77755913cb6bda1025c4baf526 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2815) * f623c7545b41a70eb607d530428536567e70fb7a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2820) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot edited a comment on pull request #3849: [HUDI-2077] Fix TestHoodieDeltaStreamerWithMultiWriter
hudi-bot edited a comment on pull request #3849: URL: https://github.com/apache/hudi/pull/3849#issuecomment-950068934 ## CI report: * 16b061c5fa2b1d77755913cb6bda1025c4baf526 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2815) * f623c7545b41a70eb607d530428536567e70fb7a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan closed issue #3845: [SUPPORT]`if not exists` doesn't work on create table in spark-sql
xushiyan closed issue #3845: URL: https://github.com/apache/hudi/issues/3845 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on issue #3845: [SUPPORT]`if not exists` doesn't work on create table in spark-sql
xushiyan commented on issue #3845: URL: https://github.com/apache/hudi/issues/3845#issuecomment-950442738 @mutoulbj @BenjMaq Thanks for raising this! It does make sense to print a message indicating table exists instead of errorring. Filing a JIRA and please feel free to take it if you're interested! https://issues.apache.org/jira/browse/HUDI-2611 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-2611) `create table if not exists` should print message instead of throwing error
Raymond Xu created HUDI-2611: Summary: `create table if not exists` should print message instead of throwing error Key: HUDI-2611 URL: https://issues.apache.org/jira/browse/HUDI-2611 Project: Apache Hudi Issue Type: Sub-task Components: Spark Integration Reporter: Raymond Xu See details in https://github.com/apache/hudi/issues/3845#issue-1033218877 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] xushiyan closed issue #3662: [SUPPORT] Error on the spark version in the desc information of the hudi CTAS Table
xushiyan closed issue #3662: URL: https://github.com/apache/hudi/issues/3662 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on issue #3662: [SUPPORT] Error on the spark version in the desc information of the hudi CTAS Table
xushiyan commented on issue #3662: URL: https://github.com/apache/hudi/issues/3662#issuecomment-950438362 @kelvin-qin thanks for reproducing this! i see it's not the right spark version info if CTAS from a hudi table. the version info not propagated correctly. I can also reproduce it; It'd be a nice fix. Filing a JIRA now. If you're keen, please feel free to take it. https://issues.apache.org/jira/browse/HUDI-2610 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-2610) Fix Spark version info for hudi table CTAS from another hudi table
Raymond Xu created HUDI-2610: Summary: Fix Spark version info for hudi table CTAS from another hudi table Key: HUDI-2610 URL: https://issues.apache.org/jira/browse/HUDI-2610 Project: Apache Hudi Issue Type: Sub-task Components: Spark Integration Reporter: Raymond Xu See details in the original issue https://github.com/apache/hudi/issues/3662#issuecomment-938489457 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] nsivabalan commented on a change in pull request #3849: [HUDI-2077] Fix TestHoodieDeltaStreamerWithMultiWriter
nsivabalan commented on a change in pull request #3849: URL: https://github.com/apache/hudi/pull/3849#discussion_r735196225 ## File path: hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/TestHoodieDeltaStreamerWithMultiWriter.java ## @@ -254,6 +254,16 @@ private static TypedProperties prepareMultiWriterProps(FileSystem fs, String bas return cfg; } + /** + * Specifically used for {@link TestHoodieDeltaStreamerWithMultiWriter}. + * + * The fixture test tables have random records generated by + * {@link org.apache.hudi.common.testutils.HoodieTestDataGenerator} using + * {@link org.apache.hudi.common.testutils.HoodieTestDataGenerator#TRIP_EXAMPLE_SCHEMA}. + * + * The COW fixture test table has 3000 unique records in 7 compaction commits. Review comment: COW can't have any compaction commits. Did you mean just regular commits? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan closed issue #3392: [SUPPORT] Compile hudi master with hive version 2.1.1 error
xushiyan closed issue #3392: URL: https://github.com/apache/hudi/issues/3392 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on issue #3392: [SUPPORT] Compile hudi master with hive version 2.1.1 error
xushiyan commented on issue #3392: URL: https://github.com/apache/hudi/issues/3392#issuecomment-950421639 Close due to inactive -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on issue #3760: [SUPPORT] Pushing hoodie metrics to prometheus having error
xushiyan commented on issue #3760: URL: https://github.com/apache/hudi/issues/3760#issuecomment-950421068 > I think spark never try to write to prometheus, even if I put a wrong address, no error. @rubenssoto can you share your settings? @liujinhui1994 could you give any suggestions or hint to the prometheus problems above? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on issue #3760: [SUPPORT] Pushing hoodie metrics to prometheus having error
xushiyan commented on issue #3760: URL: https://github.com/apache/hudi/issues/3760#issuecomment-950420645 @data-storyteller @rubenssoto can you check out this guide prepared by @nsivabalan (to be merged to website) and see the instructions help? https://github.com/apache/hudi/commit/959bd6eef8c90c11616840f975ef40a46222a913?short_path=aff66ea#diff-aff66ea1c34953a024c85c6e2fe86b8521b6cd3d623377a96d8d79c6caa8de13 @data-storyteller ``` Exception in thread "main" java.lang.NoSuchMethodError: 'void io.prometheus.client.dropwizard.DropwizardExports.(org.apache.hudi.com.codahale.metrics.MetricRegistry)' ``` Looks like it's a jar issue. Are you using hudi bundle jar? can you print your classpath too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan closed issue #3676: MOR table rolls out new parquet files at 10MB for new inserts - even though max file size set as 128MB
xushiyan closed issue #3676: URL: https://github.com/apache/hudi/issues/3676 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on issue #3676: MOR table rolls out new parquet files at 10MB for new inserts - even though max file size set as 128MB
xushiyan commented on issue #3676: URL: https://github.com/apache/hudi/issues/3676#issuecomment-950417563 @nsivabalan i also filed https://issues.apache.org/jira/browse/HUDI-2609 to make docs clearer on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-2609) Clarify small file configs in config page
[ https://issues.apache.org/jira/browse/HUDI-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2609: - Labels: user-support-issues (was: ) > Clarify small file configs in config page > - > > Key: HUDI-2609 > URL: https://issues.apache.org/jira/browse/HUDI-2609 > Project: Apache Hudi > Issue Type: Sub-task > Components: Docs >Reporter: Raymond Xu >Priority: Minor > Labels: user-support-issues > > The knowledge should be preserved in docs close to the related config keys > https://github.com/apache/hudi/issues/3676#issuecomment-922508543 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2609) Clarify small file configs in config page
Raymond Xu created HUDI-2609: Summary: Clarify small file configs in config page Key: HUDI-2609 URL: https://issues.apache.org/jira/browse/HUDI-2609 Project: Apache Hudi Issue Type: Sub-task Components: Docs Reporter: Raymond Xu The knowledge should be preserved in docs close to the related config keys https://github.com/apache/hudi/issues/3676#issuecomment-922508543 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-2607) Reorganize Hudi docs
[ https://issues.apache.org/jira/browse/HUDI-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Mahindra reassigned HUDI-2607: - Assignee: Kyle Weller > Reorganize Hudi docs > > > Key: HUDI-2607 > URL: https://issues.apache.org/jira/browse/HUDI-2607 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Kyle Weller >Assignee: Kyle Weller >Priority: Minor > Labels: pull-request-available > > Reorganize Hudi docs so they are more accessible and easier to find what you > need. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] xushiyan commented on issue #3191: [SUPPORT]client spark-submit cmd error:Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.DataSourceUtils$.PARTITIONI
xushiyan commented on issue #3191: URL: https://github.com/apache/hudi/issues/3191#issuecomment-950416601 @xer001 `PARTITIONING_COLUMNS_KEY` is **not** added in spark 2.4.0 see https://jar-download.com/artifacts/org.apache.spark/spark-sql_2.11/2.4.0/source-code/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala It was added in 2.4.2 and later Can you please upgrade your spark to newer versions of `2.4.x` ? @mdz-doit `PARTITIONING_COLUMNS_KEY` **is** there in 2.4.5 see https://jar-download.com/artifacts/org.apache.spark/spark-sql_2.11/2.4.5/source-code/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala Can you print your spark version from spark shell to make sure you have the right one? Or do you have other dependencies on your classpath? you can print the whole path to inspect. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan edited a comment on issue #3835: Hudi deltastreamer using avro schema parser when using jsonKafkaSource
xushiyan edited a comment on issue #3835: URL: https://github.com/apache/hudi/issues/3835#issuecomment-950410484 @shivabodepudi I see. The problem is you're using Json schema. The schema provider `org.apache.hudi.schema.SchemaProvider` defines only avro schema to be provided. You could extend `org.apache.hudi.schema.SchemaRegistryProvider` to convert the json schema into avro by overriding `org.apache.hudi.schema.SchemaRegistryProvider#fetchSchemaFromRegistry` Meanwhile i do think support json schema makes sense as we support JsonSource anyway. Filing a JIRA for this. https://issues.apache.org/jira/browse/HUDI-2608 @shivabodepudi if you interested, feel free to pick up this feature. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan edited a comment on issue #3835: Hudi deltastreamer using avro schema parser when using jsonKafkaSource
xushiyan edited a comment on issue #3835: URL: https://github.com/apache/hudi/issues/3835#issuecomment-950410484 @shivabodepudi I see. The problem is only Avro schema is supported and you're using Json schema. The schema provider `org.apache.hudi.schema.SchemaProvider` defines only avro schema to be provided. You could extend `org.apache.hudi.schema.SchemaRegistryProvider` to convert the json schema into avro by overriding `org.apache.hudi.schema.SchemaRegistryProvider#fetchSchemaFromRegistry` Meanwhile i do think support json schema makes sense as we support JsonSource anyway. Filing a JIRA for this. https://issues.apache.org/jira/browse/HUDI-2608 @shivabodepudi if you interested, feel free to pick up this feature. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on issue #3835: Hudi deltastreamer using avro schema parser when using jsonKafkaSource
xushiyan commented on issue #3835: URL: https://github.com/apache/hudi/issues/3835#issuecomment-950410484 @shivabodepudi I see. The problem is only Avro schema is supported and you're using Json schema. The schema provider `org.apache.hudi.schema.SchemaProvider` defines only avro schema to be provided. You could extend `org.apache.hudi.schema.SchemaRegistryProvider` to convert the json schema into avro. Meanwhile i do think support json schema makes sense as we support JsonSource anyway. Filing a JIRA for this. https://issues.apache.org/jira/browse/HUDI-2608 @shivabodepudi if you interested, feel free to pick up this feature. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan closed issue #3835: Hudi deltastreamer using avro schema parser when using jsonKafkaSource
xushiyan closed issue #3835: URL: https://github.com/apache/hudi/issues/3835 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-2608) Support JSON schema in schema registry provider
[ https://issues.apache.org/jira/browse/HUDI-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2608: - Description: To work with JSON kafka source. Original issue https://github.com/apache/hudi/issues/3835 > Support JSON schema in schema registry provider > --- > > Key: HUDI-2608 > URL: https://issues.apache.org/jira/browse/HUDI-2608 > Project: Apache Hudi > Issue Type: New Feature > Components: DeltaStreamer >Reporter: Raymond Xu >Priority: Major > > To work with JSON kafka source. > > Original issue > https://github.com/apache/hudi/issues/3835 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2608) Support JSON schema in schema registry provider
[ https://issues.apache.org/jira/browse/HUDI-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2608: - Labels: sev:normal user-support-issues (was: ) > Support JSON schema in schema registry provider > --- > > Key: HUDI-2608 > URL: https://issues.apache.org/jira/browse/HUDI-2608 > Project: Apache Hudi > Issue Type: New Feature > Components: DeltaStreamer >Reporter: Raymond Xu >Priority: Major > Labels: sev:normal, user-support-issues > > To work with JSON kafka source. > > Original issue > https://github.com/apache/hudi/issues/3835 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2608) Support JSON schema in schema registry provider
Raymond Xu created HUDI-2608: Summary: Support JSON schema in schema registry provider Key: HUDI-2608 URL: https://issues.apache.org/jira/browse/HUDI-2608 Project: Apache Hudi Issue Type: New Feature Components: DeltaStreamer Reporter: Raymond Xu -- This message was sent by Atlassian Jira (v8.3.4#803005)
[hudi] branch asf-site updated: [DOCS] Update azure_hoodie.md and docker_demo.md of cn doc (#3851)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 4814dff [DOCS] Update azure_hoodie.md and docker_demo.md of cn doc (#3851) 4814dff is described below commit 4814dff7dfc1812ba85077fc3ac1910721a81662 Author: laurieliyang <11391675+laurieliy...@users.noreply.github.com> AuthorDate: Mon Oct 25 06:20:56 2021 +0800 [DOCS] Update azure_hoodie.md and docker_demo.md of cn doc (#3851) * Update cn doc azure_hoodie.md of current and 0.8.0 * Remove version matter of azure_hoodie of current --- .../current/azure_hoodie.md| 35 ++-- .../current/docker_demo.md | 215 + .../version-0.8.0/azure_hoodie.md | 35 ++-- 3 files changed, 127 insertions(+), 158 deletions(-) diff --git a/website/i18n/cn/docusaurus-plugin-content-docs/current/azure_hoodie.md b/website/i18n/cn/docusaurus-plugin-content-docs/current/azure_hoodie.md index cbda98a..f7ccb84 100644 --- a/website/i18n/cn/docusaurus-plugin-content-docs/current/azure_hoodie.md +++ b/website/i18n/cn/docusaurus-plugin-content-docs/current/azure_hoodie.md @@ -1,41 +1,42 @@ --- -title: Azure Filesystem +title: Azure 文件系统 keywords: [ hudi, hive, azure, spark, presto] -summary: In this page, we go over how to configure Hudi with Azure filesystem. +summary: 在本页中,我们讨论如何在 Azure 文件系统中配置 Hudi 。 last_modified_at: 2020-05-25T19:00:57-04:00 language: cn --- -In this page, we explain how to use Hudi on Microsoft Azure. +在本页中,我们解释如何在 Microsoft Azure 上使用 Hudi 。 -## Disclaimer +## 声明 -This page is maintained by the Hudi community. -If the information is inaccurate or you have additional information to add. -Please feel free to create a JIRA ticket. Contribution is highly appreciated. +本页面由 Hudi 社区维护。 +如果信息不准确,或者你有信息要补充,请尽管创建 JIRA ticket。 +对此贡献高度赞赏。 -## Supported Storage System +## 支持的存储系统 -There are two storage systems support Hudi . +Hudi 支持两种存储系统。 -- Azure Blob Storage +- Azure Blob 存储 - Azure Data Lake Gen 2 -## Verified Combination of Spark and storage system +## 经过验证的 Spark 与存储系统的组合 - HDInsight Spark2.4 on Azure Data Lake Storage Gen 2 + Azure Data Lake Storage Gen 2 上的 HDInsight Spark 2.4 This combination works out of the box. No extra config needed. +这种组合开箱即用,不需要额外的配置。 - Databricks Spark2.4 on Azure Data Lake Storage Gen 2 -- Import Hudi jar to databricks workspace + Azure Data Lake Storage Gen 2 上的 Databricks Spark 2.4 +- 将 Hudi jar 包导入到 databricks 工作区 。 -- Mount the file system to dbutils. +- 将文件系统挂载到 dbutils 。 ```scala dbutils.fs.mount( source = "abfss://x...@xxx.dfs.core.windows.net", mountPoint = "/mountpoint", extraConfigs = configs) ``` -- When writing Hudi dataset, use abfss URL +- 当写入 Hudi 数据集时,使用 abfss URL ```scala inputDF.write .format("org.apache.hudi") @@ -43,7 +44,7 @@ This combination works out of the box. No extra config needed. .mode(SaveMode.Append) .save("abfss://<>.dfs.core.windows.net/hudi-tables/customer") ``` -- When reading Hudi dataset, use the mounting point +- 当读取 Hudi 数据集时,使用挂载点 ```scala spark.read .format("org.apache.hudi") diff --git a/website/i18n/cn/docusaurus-plugin-content-docs/current/docker_demo.md b/website/i18n/cn/docusaurus-plugin-content-docs/current/docker_demo.md index 3b8d1f0..eea0e88 100644 --- a/website/i18n/cn/docusaurus-plugin-content-docs/current/docker_demo.md +++ b/website/i18n/cn/docusaurus-plugin-content-docs/current/docker_demo.md @@ -6,18 +6,17 @@ last_modified_at: 2019-12-30T15:59:57-04:00 language: cn --- -## A Demo using docker containers +## 一个使用 Docker 容器的 Demo -Lets use a real world example to see how hudi works end to end. For this purpose, a self contained -data infrastructure is brought up in a local docker cluster within your computer. +我们来使用一个真实世界的案例,来看看 Hudi 是如何闭环运转的。 为了这个目的,在你的计算机中的本地 Docker 集群中组建了一个自包含的数据基础设施。 -The steps have been tested on a Mac laptop +以下步骤已经在一台 Mac 笔记本电脑上测试过了。 -### Prerequisites +### 前提条件 - * Docker Setup : For Mac, Please follow the steps as defined in [https://docs.docker.com/v17.12/docker-for-mac/install/]. For running Spark-SQL queries, please ensure atleast 6 GB and 4 CPUs are allocated to Docker (See Docker -> Preferences -> Advanced). Otherwise, spark-SQL queries could be killed because of memory issues. - * kafkacat : A command-line utility to publish/consume from kafka topics. Use `brew install kafkacat` to install kafkacat - * /etc/hosts : The demo references many services running in container by the hostname. Add the following settings to /etc/hosts + * Docker 安装 : 对于 Mac ,请依照 [https://docs.docker.com/v17.12/docker-for-mac/install/] 当中定义的步骤。 为了运行 Spark-SQL 查询,请确保至少分配给 Docker 6 GB 和 4 个 CPU 。(参见 Docker -> Preferences -> Advanced)。否则,Spark-SQL 查询可能