[jira] [Updated] (HUDI-2619) Make table services work with Dataset

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2619:
-
Description: Clustering, Compaction, Clean should also work with 
Dataset

> Make table services work with Dataset
> --
>
> Key: HUDI-2619
> URL: https://issues.apache.org/jira/browse/HUDI-2619
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>
> Clustering, Compaction, Clean should also work with Dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2619) Make table services work with Dataset

2021-10-24 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2619:


 Summary: Make table services work with Dataset
 Key: HUDI-2619
 URL: https://issues.apache.org/jira/browse/HUDI-2619
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Raymond Xu
 Fix For: 0.10.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2618) Implement operations other than upsert in SparkDataFrameWriteClient

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2618:
-
Story Points: 3  (was: 4)

> Implement operations other than upsert in SparkDataFrameWriteClient
> ---
>
> Key: HUDI-2618
> URL: https://issues.apache.org/jira/browse/HUDI-2618
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2618) Implement operations other than upsert in SparkDataFrameWriteClient

2021-10-24 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2618:


 Summary: Implement operations other than upsert in 
SparkDataFrameWriteClient
 Key: HUDI-2618
 URL: https://issues.apache.org/jira/browse/HUDI-2618
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Raymond Xu
 Fix For: 0.10.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2618) Implement operations other than upsert in SparkDataFrameWriteClient

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2618:
-
Story Points: 4

> Implement operations other than upsert in SparkDataFrameWriteClient
> ---
>
> Key: HUDI-2618
> URL: https://issues.apache.org/jira/browse/HUDI-2618
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2617) Implement HBase Index for Dataset

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2617:
-
Fix Version/s: 0.10.0

> Implement HBase Index for Dataset
> --
>
> Key: HUDI-2617
> URL: https://issues.apache.org/jira/browse/HUDI-2617
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2615) Decouple HoodieRecordPayload with Hoodie table, table services, and index

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2615:
-
Fix Version/s: 0.10.0

> Decouple HoodieRecordPayload with Hoodie table, table services, and index
> -
>
> Key: HUDI-2615
> URL: https://issues.apache.org/jira/browse/HUDI-2615
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>
> HoodieTable, HoodieIndex, and compaction, clustering services should be 
> independent of HoodieRecordPayload



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2531) [UMBRELLA] Support Dataset APIs in writer paths

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2531:
-
Fix Version/s: 0.10.0

> [UMBRELLA] Support Dataset APIs in writer paths
> ---
>
> Key: HUDI-2531
> URL: https://issues.apache.org/jira/browse/HUDI-2531
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: hudi-umbrellas
> Fix For: 0.10.0
>
>
> To make use of Dataset APIs in writer paths instead of RDD.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2616) Implement BloomIndex for Dataset

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2616:
-
Fix Version/s: 0.10.0

> Implement BloomIndex for Dataset
> -
>
> Key: HUDI-2616
> URL: https://issues.apache.org/jira/browse/HUDI-2616
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Raymond Xu
>Priority: Major
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] danny0405 commented on a change in pull request #3599: [HUDI-2207] Support independent flink hudi clustering function

2021-10-24 Thread GitBox


danny0405 commented on a change in pull request #3599:
URL: https://github.com/apache/hudi/pull/3599#discussion_r735249946



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
##
@@ -528,6 +528,66 @@ private FlinkOptions() {
   .defaultValue(20)// default min 20 commits
   .withDescription("Min number of commits to keep before archiving older 
commits into a sequential log, default 20");
 
+  // 
+  //  Clustering Options
+  // 
+
+  public static final ConfigOption CLUSTERING_SCHEDULE_ENABLED = 
ConfigOptions
+  .key("clustering.schedule.enabled")
+  .booleanType()
+  .defaultValue(false) // default false for pipeline
+  .withDescription("Async clustering, default false for pipeline");
+
+  public static final ConfigOption CLUSTERING_TASKS = ConfigOptions
+  .key("clustering.tasks")
+  .intType()
+  .defaultValue(10)
+  .withDescription("Parallelism of tasks that do actual clustering, 
default is 10");

Review comment:
   Change the default value same with `compaction.tasks`, which is `4`.

##
File path: 
hudi-flink/src/test/java/org/apache/hudi/sink/cluster/ITTestHoodieFlinkClustering.java
##
@@ -0,0 +1,178 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.sink.cluster;
+
+import org.apache.hudi.avro.model.HoodieClusteringPlan;
+import org.apache.hudi.client.HoodieFlinkWriteClient;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.configuration.FlinkOptions;
+import org.apache.hudi.sink.clustering.ClusteringCommitEvent;
+import org.apache.hudi.sink.clustering.ClusteringCommitSink;
+import org.apache.hudi.sink.clustering.ClusteringFunction;
+import org.apache.hudi.sink.clustering.ClusteringPlanSourceFunction;
+import org.apache.hudi.sink.clustering.FlinkClusteringConfig;
+import org.apache.hudi.table.HoodieFlinkTable;
+import org.apache.hudi.util.AvroSchemaConverter;
+import org.apache.hudi.util.CompactionUtil;
+import org.apache.hudi.util.StreamerUtil;
+import org.apache.hudi.utils.TestConfigurations;
+import org.apache.hudi.utils.TestData;
+
+import org.apache.avro.Schema;
+import org.apache.flink.api.common.typeinfo.TypeInformation;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
+import org.apache.flink.streaming.api.operators.ProcessOperator;
+import org.apache.flink.table.api.EnvironmentSettings;
+import org.apache.flink.table.api.TableEnvironment;
+import org.apache.flink.table.api.config.ExecutionConfigOptions;
+import org.apache.flink.table.api.internal.TableEnvironmentImpl;
+import org.apache.flink.table.types.DataType;
+import org.apache.flink.table.types.logical.RowType;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import java.io.File;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+public class ITTestHoodieFlinkClustering {
+
+  private static final Map EXPECTED = new HashMap<>();
+
+  static {
+EXPECTED.put("par1", "[id1,par1,id1,Danny,23,1000,par1, 
id2,par1,id2,Stephen,33,2000,par1]");
+EXPECTED.put("par2", "[id3,par2,id3,Julian,53,3000,par2, 
id4,par2,id4,Fabian,31,4000,par2]");
+EXPECTED.put("par3", "[id5,par3,id5,Sophia,18,5000,par3, 
id6,par3,id6,Emma,20,6000,par3]");
+EXPECTED.put("par4", "[id7,par4,id7,Bob,44,7000,par4, 
id8,par4,id8,Han,56,8000,par4]");
+  }
+
+  @TempDir
+  File tempFile;
+
+  @Test
+  public void testHoodieFlinkClustering() throws Exception {
+// Create h

[jira] [Created] (HUDI-2617) Implement HBase Index for Dataset

2021-10-24 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2617:


 Summary: Implement HBase Index for Dataset
 Key: HUDI-2617
 URL: https://issues.apache.org/jira/browse/HUDI-2617
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Raymond Xu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1430:
-
Description: End to end upsert operation, with proper functional tests 
coverage.

> Implement SparkDataFrameWriteClient with SimpleIndex
> 
>
> Key: HUDI-1430
> URL: https://issues.apache.org/jira/browse/HUDI-1430
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>
> End to end upsert operation, with proper functional tests coverage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1430:
-
Story Points: 3  (was: 2)

> Implement SparkDataFrameWriteClient with SimpleIndex
> 
>
> Key: HUDI-1430
> URL: https://issues.apache.org/jira/browse/HUDI-1430
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2616) Implement BloomIndex for Dataset

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2616:
-
Story Points: 2

> Implement BloomIndex for Dataset
> -
>
> Key: HUDI-2616
> URL: https://issues.apache.org/jira/browse/HUDI-2616
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Raymond Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2616) Implement BloomIndex for Dataset

2021-10-24 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2616:


 Summary: Implement BloomIndex for Dataset
 Key: HUDI-2616
 URL: https://issues.apache.org/jira/browse/HUDI-2616
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Raymond Xu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2615) Decouple HoodieRecordPayload with Hoodie table, table services, and index

2021-10-24 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2615:


 Summary: Decouple HoodieRecordPayload with Hoodie table, table 
services, and index
 Key: HUDI-2615
 URL: https://issues.apache.org/jira/browse/HUDI-2615
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Raymond Xu


HoodieTable, HoodieIndex, and compaction, clustering services should be 
independent of HoodieRecordPayload



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2531) [UMBRELLA] Support Dataset APIs in writer paths

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2531:
-
Priority: Blocker  (was: Critical)

> [UMBRELLA] Support Dataset APIs in writer paths
> ---
>
> Key: HUDI-2531
> URL: https://issues.apache.org/jira/browse/HUDI-2531
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: hudi-umbrellas
>
> To make use of Dataset APIs in writer paths instead of RDD.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1430:
-
Story Points: 2

> Implement SparkDataFrameWriteClient with SimpleIndex
> 
>
> Key: HUDI-1430
> URL: https://issues.apache.org/jira/browse/HUDI-1430
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1430:
-
Status: In Progress  (was: Open)

> Implement SparkDataFrameWriteClient with SimpleIndex
> 
>
> Key: HUDI-1430
> URL: https://issues.apache.org/jira/browse/HUDI-1430
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1430:
-
Parent: HUDI-2531
Issue Type: Sub-task  (was: Improvement)

> Implement SparkDataFrameWriteClient with SimpleIndex
> 
>
> Key: HUDI-1430
> URL: https://issues.apache.org/jira/browse/HUDI-1430
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1430) Implement SparkDataFrameWriteClient with SimpleIndex

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1430:
-
Summary: Implement SparkDataFrameWriteClient with SimpleIndex  (was: 
Support Dataset write w/o conversion to RDD)

> Implement SparkDataFrameWriteClient with SimpleIndex
> 
>
> Key: HUDI-1430
> URL: https://issues.apache.org/jira/browse/HUDI-1430
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1970) Performance testing/certification of key SQL DMLs

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1970:
-
Status: In Progress  (was: Open)

> Performance testing/certification of key SQL DMLs
> -
>
> Key: HUDI-1970
> URL: https://issues.apache.org/jira/browse/HUDI-1970
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Performance, Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1970) Performance testing/certification of key SQL DMLs

2021-10-24 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17433587#comment-17433587
 ] 

Raymond Xu commented on HUDI-1970:
--

* 1B records (randomized values in the example trip model)
 * 100 partitions, evenly distributed, year=*/month=*/day=*, 50 parquet files / 
partition
 * EMR 6.2 Spark 3.0.1-amzn-0
 * S3, parquet compression snappy
 * hudi: 109.8 GB = 22.4 MB parquet x 5000
 * delta: 70.9 GB = 14.5 MB parquet x 5000

|SQL|Hudi 0.9.0|
|select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 
20.0|129.352|108.312|104.914|
|select count(*) from hudi_trips_snapshot|96.001|83.839|66.973|
|select count(*) from hudi_trips_snapshot where year = '2020' and month = '03' 
and day = '01'|1.880|1.776|1.767|
|select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where 
year='2020' and month='03' and day='01' and fare between 20 and 
50|3.650|3.147|3.086|

> Performance testing/certification of key SQL DMLs
> -
>
> Key: HUDI-1970
> URL: https://issues.apache.org/jira/browse/HUDI-1970
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Performance, Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2287) Partition pruning not working on Hudi dataset

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2287:
-
Priority: Major  (was: Blocker)

> Partition pruning not working on Hudi dataset
> -
>
> Key: HUDI-2287
> URL: https://issues.apache.org/jira/browse/HUDI-2287
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Performance
>Reporter: Rajkumar Gunasekaran
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.10.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi, we have created  a Hudi dataset which has two level partition like this
> {code:java}
> s3://somes3bucket/partition1=value/partition2=value
> {code}
> where _partition1_ and _partition2_ is of type string
> When running a simple count query using Hudi format in spark-shell, it takes 
> almost 3 minutes to complete
>   
> {code:scala}
> spark.read.format("hudi").load("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
>  
> res1: Long = 
> attempt 1: 3.2 minutes
>  attempt 2: 2.5 minutes
> {code}
> In the Spark UI ~9000 tasks (which is approximately equivalent to the total 
> no of files in the ENTIRE dataset s3://somes3bucket) are used for 
> computation. Seems like spark is reading the entire dataset instead of 
> *partition pruning.*...and then filtering the dataset based on the where 
> clause
> Whereas, if I use the parquet format to read the dataset, the query only 
> takes ~30 seconds (vis-a-vis 3 minutes with Hudi format)
> {code:scala}
> spark.read.parquet("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
> res2: Long = 
> ~ 30 seconds
> {code}
> In the spark UI, only 1361 (ie 1361 tasks) files are scanned (vis-a-vis ~9000 
> files in Hudi) and takes only 15 seconds
> Any idea why partition pruning is not working when using Hudi format? 
> Wondering if I am missing any configuration during the creation of the 
> dataset?
> PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0 and here is 
> the configuration I have used for creating the dataset
> {code:scala}
> df.writeStream
>  .trigger(Trigger.ProcessingTime(s"${param.triggerTimeInSeconds} seconds"))
>  .partitionBy("partition1","partition2")
>  .format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, param.hiveNHudiTableName.get)
>  //--
>  .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy")
>  .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, 
> param.expectedFileSizeInBytes)
>  .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, 
> HoodieStorageConfig.DEFAULT_PARQUET_BLOCK_SIZE_BYTES)
>  //--
>  .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, 
> (param.expectedFileSizeInBytes / 100) * 80)
>  .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "true")
>  .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, 
> param.runCompactionAfterNDeltaCommits.get)
>  //--
>  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "record_key_id")
>  .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
> classOf[CustomKeyGenerator].getName)
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
> "partition1:SIMPLE,partition2:SIMPLE")
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
> hudiTablePrecombineKey)
>  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>  //.option(DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY, "false")
>  .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
>  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> "partition1,partition2")
>  .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, param.hiveDb.get)
>  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, 
> param.hiveNHudiTableName.get)
>  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> classOf[MultiPartKeysValueExtractor].getName)
>  .outputMode(OutputMode.Append())
>  .queryName(s"${param.hiveDb}_${param.hiveNHudiTableName}_query"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-2287) Partition pruning not working on Hudi dataset

2021-10-24 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17433586#comment-17433586
 ] 

Raymond Xu commented on HUDI-2287:
--

[~rjkumr] it's likely caused by your `hoodie.table.partition.fields` config in 
your hoodie.properties. As you're using a CustomKeyGenerator, not sure how that 
affects the partition field settings. In case of SimpleKeyGenerator, you'd 
expect `hoodie.table.partition.fields=partition1,partition2`. You can manually 
modify it and by setting it right for your CustomKeyGenerator's logic, you 
should be able to get partition pruning to work.

> Partition pruning not working on Hudi dataset
> -
>
> Key: HUDI-2287
> URL: https://issues.apache.org/jira/browse/HUDI-2287
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Performance
>Reporter: Rajkumar Gunasekaran
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi, we have created  a Hudi dataset which has two level partition like this
> {code:java}
> s3://somes3bucket/partition1=value/partition2=value
> {code}
> where _partition1_ and _partition2_ is of type string
> When running a simple count query using Hudi format in spark-shell, it takes 
> almost 3 minutes to complete
>   
> {code:scala}
> spark.read.format("hudi").load("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
>  
> res1: Long = 
> attempt 1: 3.2 minutes
>  attempt 2: 2.5 minutes
> {code}
> In the Spark UI ~9000 tasks (which is approximately equivalent to the total 
> no of files in the ENTIRE dataset s3://somes3bucket) are used for 
> computation. Seems like spark is reading the entire dataset instead of 
> *partition pruning.*...and then filtering the dataset based on the where 
> clause
> Whereas, if I use the parquet format to read the dataset, the query only 
> takes ~30 seconds (vis-a-vis 3 minutes with Hudi format)
> {code:scala}
> spark.read.parquet("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
> res2: Long = 
> ~ 30 seconds
> {code}
> In the spark UI, only 1361 (ie 1361 tasks) files are scanned (vis-a-vis ~9000 
> files in Hudi) and takes only 15 seconds
> Any idea why partition pruning is not working when using Hudi format? 
> Wondering if I am missing any configuration during the creation of the 
> dataset?
> PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0 and here is 
> the configuration I have used for creating the dataset
> {code:scala}
> df.writeStream
>  .trigger(Trigger.ProcessingTime(s"${param.triggerTimeInSeconds} seconds"))
>  .partitionBy("partition1","partition2")
>  .format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, param.hiveNHudiTableName.get)
>  //--
>  .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy")
>  .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, 
> param.expectedFileSizeInBytes)
>  .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, 
> HoodieStorageConfig.DEFAULT_PARQUET_BLOCK_SIZE_BYTES)
>  //--
>  .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, 
> (param.expectedFileSizeInBytes / 100) * 80)
>  .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "true")
>  .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, 
> param.runCompactionAfterNDeltaCommits.get)
>  //--
>  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "record_key_id")
>  .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
> classOf[CustomKeyGenerator].getName)
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
> "partition1:SIMPLE,partition2:SIMPLE")
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
> hudiTablePrecombineKey)
>  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>  //.option(DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY, "false")
>  .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
>  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> "partition1,partition2")
>  .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, param.hiveDb.get)
>  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, 
> param.hiveNHudiTableName.get)
>  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> classOf[MultiPartKeysValueExtractor].getName)
>  .outputMode(OutputMode.Append())
>  .queryName(s"${param.hiveDb}_${param.hiveNHudiTableName}_query"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] hudi-bot edited a comment on pull request #3858: [MINOR] Fix README for hudi-kafka-connect

2021-10-24 Thread GitBox


hudi-bot edited a comment on pull request #3858:
URL: https://github.com/apache/hudi/pull/3858#issuecomment-950564845


   
   ## CI report:
   
   * f2ed52360c22cba5bbade224be9b3a6cec660d36 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2827)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3858: [MINOR] Fix README for hudi-kafka-connect

2021-10-24 Thread GitBox


hudi-bot commented on pull request #3858:
URL: https://github.com/apache/hudi/pull/3858#issuecomment-950564845


   
   ## CI report:
   
   * f2ed52360c22cba5bbade224be9b3a6cec660d36 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3857: [WIP][HUDI-2332] Add clustering and compaction in Kafka Connect Sink

2021-10-24 Thread GitBox


hudi-bot edited a comment on pull request #3857:
URL: https://github.com/apache/hudi/pull/3857#issuecomment-950560156


   
   ## CI report:
   
   * 34cb663a0afb4362af0795384058378ef6ec130a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2825)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yihua opened a new pull request #3858: [MINOR] Fix README for hudi-kafka-connect

2021-10-24 Thread GitBox


yihua opened a new pull request #3858:
URL: https://github.com/apache/hudi/pull/3858


   ## What is the purpose of the pull request
   
   This PR fixes the tutorial in README.md for hudi-kafka-connect.
   
   ## Brief change log
   
   - Edits to the commands so that they are runnable.
   
   ## Verify this pull request
   
   Successfully run the commands in the tutorial to make sure Kafka Connect 
Sink for Hudi can be set up locally.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3857: [WIP][HUDI-2332] Add clustering and compaction in Kafka Connect Sink

2021-10-24 Thread GitBox


hudi-bot commented on pull request #3857:
URL: https://github.com/apache/hudi/pull/3857#issuecomment-950560156


   
   ## CI report:
   
   * 34cb663a0afb4362af0795384058378ef6ec130a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2332) Implement scheduling of compaction/ clustering for Kafka Connect

2021-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-2332:
-
Labels: pull-request-available  (was: )

> Implement scheduling of compaction/ clustering for Kafka Connect
> 
>
> Key: HUDI-2332
> URL: https://issues.apache.org/jira/browse/HUDI-2332
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Rajesh Mahindra
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> * Implement compaction/ clustering etc. from Java client
>  * Schedule from Coordinator



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] yihua opened a new pull request #3857: [WIP][HUDI-2332] Add clustering and compaction in Kafka Connect Sink

2021-10-24 Thread GitBox


yihua opened a new pull request #3857:
URL: https://github.com/apache/hudi/pull/3857


   ## What is the purpose of the pull request
   
   This PR adds the functionality of clustering and compaction in Kafka Connect 
Sink for Hudi.
   
   ## Brief change log
   
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3802: [HUDI-1500] Support replace commit in DeltaSync with commit metadata preserved

2021-10-24 Thread GitBox


hudi-bot edited a comment on pull request #3802:
URL: https://github.com/apache/hudi/pull/3802#issuecomment-943342747


   
   ## CI report:
   
   * b63edfaca889ac6444b61a525cc9ee1065f610db Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2824)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2077) Flaky test: TestHoodieDeltaStreamer

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2077:
-
Priority: Critical  (was: Major)

> Flaky test: TestHoodieDeltaStreamer
> ---
>
> Key: HUDI-2077
> URL: https://issues.apache.org/jira/browse/HUDI-2077
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Raymond Xu
>Assignee: Sagar Sumit
>Priority: Critical
>  Labels: pull-request-available
> Attachments: 28.txt, hudi_2077_schema_mismatch.txt
>
>
> {code:java}
>  [INFO] Results:8520[INFO] 8521[ERROR] Errors: 8522[ERROR]   
> TestHoodieDeltaStreamer.testUpsertsMORContinuousModeWithMultipleWriters:716->testUpsertsContinuousModeWithMultipleWriters:831->runJobsInParallel:940
>  » Execution{code}
>  Search "testUpsertsMORContinuousModeWithMultipleWriters" in the log file for 
> details.
> {quote} 
> 1730667 [pool-1461-thread-1] WARN 
> org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer - Got error : }}
>  org.apache.hudi.exception.HoodieIOException: Could not check if 
> hdfs://localhost:4/user/vsts/continuous_mor_mulitwriter is a valid table 
>  at 
> org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:59)
>  
>  at 
> org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:112)
>  
>  at 
> org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:73)
>  
>  at 
> org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:606)
>  
>  at 
> org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer$TestHelpers.assertAtleastNDeltaCommitsAfterCommit(TestHoodieDeltaStreamer.java:322)
>  
>  at 
> org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer.lambda$runJobsInParallel$8(TestHoodieDeltaStreamer.java:906)
>  
>  at 
> org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer$TestHelpers.lambda$waitTillCondition$0(TestHoodieDeltaStreamer.java:347)
>  
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
>  at java.lang.Thread.run(Thread.java:748) 
>  {{Caused by: java.net.ConnectException: Call From fv-az238-328/10.1.0.24 to 
> localhost:4 failed on connection exception: java.net.ConnectException: 
> Connection refused; For more details see: 
> [http://wiki.apache.org/hadoop/ConnectionRefused]
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1706) Test flakiness w/ multiwriter test

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1706:
-
Priority: Major  (was: Blocker)

> Test flakiness w/ multiwriter test
> --
>
> Key: HUDI-1706
> URL: https://issues.apache.org/jira/browse/HUDI-1706
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.10.0
>
>
> [https://api.travis-ci.com/v3/job/492130170/log.txt]
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2614) Remove duplicated hadoop-hdfs with tests classifier exists in bundles

2021-10-24 Thread vinoyang (Jira)
vinoyang created HUDI-2614:
--

 Summary: Remove duplicated hadoop-hdfs with tests classifier 
exists in bundles
 Key: HUDI-2614
 URL: https://issues.apache.org/jira/browse/HUDI-2614
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: vinoyang
Assignee: vinoyang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2600) Remove duplicated hadoop-common with tests classifier exists in bundles

2021-10-24 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-2600:
---
Fix Version/s: 0.10.0

> Remove duplicated hadoop-common with tests classifier exists in bundles
> ---
>
> Key: HUDI-2600
> URL: https://issues.apache.org/jira/browse/HUDI-2600
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Release & Administrative
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> We found many duplicated dependencies in the generated dependency list, 
> `hadoop-common` is one of them:
> {code:java}
> hadoop-common/org.apache.hadoop/2.7.3//hadoop-common-2.7.3.jar
> hadoop-common/org.apache.hadoop/2.7.3/tests/hadoop-common-2.7.3-tests.jar
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-2600) Remove duplicated hadoop-common with tests classifier exists in bundles

2021-10-24 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang closed HUDI-2600.
--
Resolution: Done

220bf6a7e6f5cdf0efbbbee9df6852a8b2288570

> Remove duplicated hadoop-common with tests classifier exists in bundles
> ---
>
> Key: HUDI-2600
> URL: https://issues.apache.org/jira/browse/HUDI-2600
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Release & Administrative
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> We found many duplicated dependencies in the generated dependency list, 
> `hadoop-common` is one of them:
> {code:java}
> hadoop-common/org.apache.hadoop/2.7.3//hadoop-common-2.7.3.jar
> hadoop-common/org.apache.hadoop/2.7.3/tests/hadoop-common-2.7.3-tests.jar
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[hudi] branch master updated: [HUDI-2600] Remove duplicated hadoop-common with tests classifier exists in bundles (#3847)

2021-10-24 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 220bf6a  [HUDI-2600] Remove duplicated hadoop-common with tests 
classifier exists in bundles (#3847)
220bf6a is described below

commit 220bf6a7e6f5cdf0efbbbee9df6852a8b2288570
Author: vinoyang 
AuthorDate: Mon Oct 25 13:45:28 2021 +0800

[HUDI-2600] Remove duplicated hadoop-common with tests classifier exists in 
bundles (#3847)
---
 dependencies/hudi-flink-bundle_2.11.txt  | 6 +++---
 dependencies/hudi-hive-sync-bundle.txt   | 7 +--
 dependencies/hudi-kafka-connect-bundle.txt   | 3 +--
 dependencies/hudi-spark-bundle_2.11.txt  | 3 +--
 dependencies/hudi-timeline-server-bundle.txt | 1 -
 dependencies/hudi-utilities-bundle_2.11.txt  | 3 +--
 hudi-client/hudi-client-common/pom.xml   | 1 +
 hudi-sync/hudi-hive-sync/pom.xml | 1 +
 hudi-timeline-service/pom.xml| 1 +
 9 files changed, 10 insertions(+), 16 deletions(-)

diff --git a/dependencies/hudi-flink-bundle_2.11.txt 
b/dependencies/hudi-flink-bundle_2.11.txt
index b97995c..4414594 100644
--- a/dependencies/hudi-flink-bundle_2.11.txt
+++ b/dependencies/hudi-flink-bundle_2.11.txt
@@ -64,7 +64,7 @@ commons-lang/commons-lang/2.6//commons-lang-2.6.jar
 commons-lang3/org.apache.commons/3.1//commons-lang3-3.1.jar
 commons-logging/commons-logging/1.2//commons-logging-1.2.jar
 commons-math/org.apache.commons/2.2//commons-math-2.2.jar
-commons-math3/org.apache.commons/3.1.1//commons-math3-3.1.1.jar
+commons-math3/org.apache.commons/3.5//commons-math3-3.5.jar
 commons-net/commons-net/3.1//commons-net-3.1.jar
 commons-pool/commons-pool/1.6//commons-pool-1.6.jar
 config/com.typesafe/1.3.3//config-1.3.3.jar
@@ -107,6 +107,7 @@ 
force-shading/org.apache.flink/1.13.1//force-shading-1.13.1.jar
 grizzled-slf4j_2.11/org.clapper/1.3.2//grizzled-slf4j_2.11-1.3.2.jar
 groovy-all/org.codehaus.groovy/2.4.4//groovy-all-2.4.4.jar
 gson/com.google.code.gson/2.3.1//gson-2.3.1.jar
+guava/com.google.guava/12.0.1//guava-12.0.1.jar
 
guice-assistedinject/com.google.inject.extensions/3.0//guice-assistedinject-3.0.jar
 guice-servlet/com.google.inject.extensions/3.0//guice-servlet-3.0.jar
 guice/com.google.inject/3.0//guice-3.0.jar
@@ -114,7 +115,6 @@ 
hadoop-annotations/org.apache.hadoop/2.7.3//hadoop-annotations-2.7.3.jar
 hadoop-auth/org.apache.hadoop/2.7.3//hadoop-auth-2.7.3.jar
 hadoop-client/org.apache.hadoop/2.7.3//hadoop-client-2.7.3.jar
 hadoop-common/org.apache.hadoop/2.7.3//hadoop-common-2.7.3.jar
-hadoop-common/org.apache.hadoop/2.7.3/tests/hadoop-common-2.7.3-tests.jar
 hadoop-hdfs/org.apache.hadoop/2.7.3//hadoop-hdfs-2.7.3.jar
 hadoop-hdfs/org.apache.hadoop/2.7.3/tests/hadoop-hdfs-2.7.3-tests.jar
 
hadoop-mapreduce-client-app/org.apache.hadoop/2.7.3//hadoop-mapreduce-client-app-2.7.3.jar
@@ -132,7 +132,7 @@ 
hadoop-yarn-server-resourcemanager/org.apache.hadoop/2.7.2//hadoop-yarn-server-r
 
hadoop-yarn-server-web-proxy/org.apache.hadoop/2.7.2//hadoop-yarn-server-web-proxy-2.7.2.jar
 hamcrest-core/org.hamcrest/1.3//hamcrest-core-1.3.jar
 hbase-annotations/org.apache.hbase/1.2.3//hbase-annotations-1.2.3.jar
-hbase-client/org.apache.hbase/1.1.1//hbase-client-1.1.1.jar
+hbase-client/org.apache.hbase/1.2.3//hbase-client-1.2.3.jar
 hbase-common/org.apache.hbase/1.2.3//hbase-common-1.2.3.jar
 hbase-common/org.apache.hbase/1.2.3/tests/hbase-common-1.2.3-tests.jar
 hbase-hadoop-compat/org.apache.hbase/1.2.3//hbase-hadoop-compat-1.2.3.jar
diff --git a/dependencies/hudi-hive-sync-bundle.txt 
b/dependencies/hudi-hive-sync-bundle.txt
index aefcfbb..f80ee31 100644
--- a/dependencies/hudi-hive-sync-bundle.txt
+++ b/dependencies/hudi-hive-sync-bundle.txt
@@ -56,7 +56,6 @@ 
hadoop-annotations/org.apache.hadoop/2.7.3//hadoop-annotations-2.7.3.jar
 hadoop-auth/org.apache.hadoop/2.7.3//hadoop-auth-2.7.3.jar
 hadoop-client/org.apache.hadoop/2.7.3//hadoop-client-2.7.3.jar
 hadoop-common/org.apache.hadoop/2.7.3//hadoop-common-2.7.3.jar
-hadoop-common/org.apache.hadoop/2.7.3/tests/hadoop-common-2.7.3-tests.jar
 hadoop-hdfs/org.apache.hadoop/2.7.3//hadoop-hdfs-2.7.3.jar
 hadoop-hdfs/org.apache.hadoop/2.7.3/tests/hadoop-hdfs-2.7.3-tests.jar
 
hadoop-mapreduce-client-app/org.apache.hadoop/2.7.3//hadoop-mapreduce-client-app-2.7.3.jar
@@ -87,9 +86,7 @@ 
jackson-annotations/com.fasterxml.jackson.core/2.6.7//jackson-annotations-2.6.7.
 jackson-core-asl/org.codehaus.jackson/1.9.13//jackson-core-asl-1.9.13.jar
 jackson-core/com.fasterxml.jackson.core/2.6.7//jackson-core-2.6.7.jar
 
jackson-databind/com.fasterxml.jackson.core/2.6.7.3//jackson-databind-2.6.7.3.jar
-jackson-jaxrs/org.codehaus.jackson/1.9.13//jackson-jaxrs-1.9.13.jar
 jackson-mapper-asl/org.codehaus.jackson/1.9.13//jackson-mapper-asl-1.9.13.jar
-jackson-xc/org.codehaus.jackson/1.9.13//jackson-xc-1.9.13.jar
 jamon-runtime/org.jamon/2.4.1//jamon-runti

[GitHub] [hudi] yanghua merged pull request #3847: [HUDI-2600] Remove duplicated hadoop-common with tests classifier exists in bundles

2021-10-24 Thread GitBox


yanghua merged pull request #3847:
URL: https://github.com/apache/hudi/pull/3847


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #3762: [HUDI-1294] Adding inline read and seek based read(batch get) for hfile log blocks in metadata table

2021-10-24 Thread GitBox


nsivabalan commented on a change in pull request #3762:
URL: https://github.com/apache/hudi/pull/3762#discussion_r735269130



##
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java
##
@@ -200,8 +201,49 @@ protected BaseTableMetadata(HoodieEngineContext 
engineContext, HoodieMetadataCon
 return statuses;
   }
 
+  Map fetchAllFilesInPartitionPaths(List 
partitionPaths) throws IOException {

Review comment:
   all of our tests in TestHoodieBackedMetadata uses 
HoodieBackedTableMetadata for assertions. I already tried writing unit tests, 
and felt its already covered and hence did not write one explicitly. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on pull request #3762: [HUDI-1294] Adding inline read and seek based read(batch get) for hfile log blocks in metadata table

2021-10-24 Thread GitBox


nsivabalan commented on pull request #3762:
URL: https://github.com/apache/hudi/pull/3762#issuecomment-950546155


   @prashantwason : Can you review the patch please when you get time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #3762: [HUDI-1294] Adding inline read and seek based read(batch get) for hfile log blocks in metadata table

2021-10-24 Thread GitBox


nsivabalan commented on a change in pull request #3762:
URL: https://github.com/apache/hudi/pull/3762#discussion_r735268438



##
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java
##
@@ -120,65 +120,114 @@ private void initIfNeeded() {
   }
 
   @Override
-  protected Option> 
getRecordByKeyFromMetadata(String key, String partitionName) {
-Pair readers = 
openReadersIfNeeded(key, partitionName);
+  protected Option> getRecordByKey(String 
key, String partitionName) {
+return getRecordsByKeys(Collections.singletonList(key), 
partitionName).get(0).getValue();
+  }
+
+  protected List>>> 
getRecordsByKeys(List keys, String partitionName) {

Review comment:
   1. I see your point regarding inline/full scan to be configurable 
depending on diff partitions in metadata. I will address this.  I guess, FILES 
will do full scan. while col_stats(min max stats), bloom_filter and record 
index will do inline. 
   2. If we go with (1), not sure if we need to try out (2) as well. with (1).
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on pull request #3827: [HUDI-2573] Fixing double locking with multi-writers

2021-10-24 Thread GitBox


nsivabalan commented on pull request #3827:
URL: https://github.com/apache/hudi/pull/3827#issuecomment-950539640


   @manojpec : thanks for your inputs. I do like the idea of TransactionManager 
handling the locking depending on whether the lock acquisition is requested by 
same owner or diff. But I see some impl hurdles in that. Let me see how I can 
go about that. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan merged pull request #3757: [HUDI-2005] Avoiding direct fs calls in HoodieLogFileReader

2021-10-24 Thread GitBox


nsivabalan merged pull request #3757:
URL: https://github.com/apache/hudi/pull/3757


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [HUDI-2005] Avoiding direct fs calls in HoodieLogFileReader (#3757)

2021-10-24 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 1bb0532  [HUDI-2005] Avoiding direct fs calls in HoodieLogFileReader 
(#3757)
1bb0532 is described below

commit 1bb05325637740498cac548872cf7223e34950d0
Author: Sivabalan Narayanan 
AuthorDate: Mon Oct 25 01:21:08 2021 -0400

[HUDI-2005] Avoiding direct fs calls in HoodieLogFileReader (#3757)
---
 .../hudi/common/table/log/HoodieLogFileReader.java | 12 ++--
 .../apache/hudi/common/table/log/HoodieLogFormat.java  |  2 +-
 .../hudi/common/table/log/HoodieLogFormatReader.java   |  4 ++--
 .../apache/hudi/common/table/log/LogReaderUtils.java   | 18 +++---
 .../hudi/metadata/HoodieMetadataFileSystemView.java|  2 +-
 .../hadoop/realtime/AbstractRealtimeRecordReader.java  |  2 +-
 .../hudi/hadoop/realtime/HoodieRealtimeFileSplit.java  | 12 ++--
 .../realtime/RealtimeBootstrapBaseFileSplit.java   | 13 +++--
 .../org/apache/hudi/hadoop/realtime/RealtimeSplit.java |  3 +++
 .../hadoop/utils/HoodieRealtimeInputFormatUtils.java   |  5 +++--
 .../hadoop/realtime/TestHoodieRealtimeFileSplit.java   |  5 -
 .../realtime/TestHoodieRealtimeRecordReader.java   | 17 +
 12 files changed, 62 insertions(+), 33 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
index f0f3842..88b7e32 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
@@ -74,6 +74,11 @@ public class HoodieLogFileReader implements 
HoodieLogFormat.Reader {
   private transient Thread shutdownThread = null;
 
   public HoodieLogFileReader(FileSystem fs, HoodieLogFile logFile, Schema 
readerSchema, int bufferSize,
+ boolean readBlockLazily) throws IOException {
+this(fs, logFile, readerSchema, bufferSize, readBlockLazily, false);
+  }
+
+  public HoodieLogFileReader(FileSystem fs, HoodieLogFile logFile, Schema 
readerSchema, int bufferSize,
  boolean readBlockLazily, boolean reverseReader) 
throws IOException {
 FSDataInputStream fsDataInputStream = fs.open(logFile.getPath(), 
bufferSize);
 this.logFile = logFile;
@@ -82,16 +87,11 @@ public class HoodieLogFileReader implements 
HoodieLogFormat.Reader {
 this.readBlockLazily = readBlockLazily;
 this.reverseReader = reverseReader;
 if (this.reverseReader) {
-  this.reverseLogFilePosition = this.lastReverseLogFilePosition = 
fs.getFileStatus(logFile.getPath()).getLen();
+  this.reverseLogFilePosition = this.lastReverseLogFilePosition = 
logFile.getFileSize();
 }
 addShutDownHook();
   }
 
-  public HoodieLogFileReader(FileSystem fs, HoodieLogFile logFile, Schema 
readerSchema, boolean readBlockLazily,
-  boolean reverseReader) throws IOException {
-this(fs, logFile, readerSchema, DEFAULT_BUFFER_SIZE, readBlockLazily, 
reverseReader);
-  }
-
   public HoodieLogFileReader(FileSystem fs, HoodieLogFile logFile, Schema 
readerSchema) throws IOException {
 this(fs, logFile, readerSchema, DEFAULT_BUFFER_SIZE, false, false);
   }
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormat.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormat.java
index c566788..569b4a2 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormat.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormat.java
@@ -274,7 +274,7 @@ public interface HoodieLogFormat {
 
   static HoodieLogFormat.Reader newReader(FileSystem fs, HoodieLogFile 
logFile, Schema readerSchema)
   throws IOException {
-return new HoodieLogFileReader(fs, logFile, readerSchema, 
HoodieLogFileReader.DEFAULT_BUFFER_SIZE, false, false);
+return new HoodieLogFileReader(fs, logFile, readerSchema, 
HoodieLogFileReader.DEFAULT_BUFFER_SIZE, false);
   }
 
   static HoodieLogFormat.Reader newReader(FileSystem fs, HoodieLogFile 
logFile, Schema readerSchema,
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatReader.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatReader.java
index 7267227..e64e1a1 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatReader.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatReader.java
@@ -59,7 +59,7 @@ public class HoodieLogFormatReader implements 
HoodieLogFormat.Reader {
 this.prevReadersInOpenState = new ArrayList<>();
 if (logFiles.size() > 0) {
   HoodieLogFile n

[GitHub] [hudi] nsivabalan commented on a change in pull request #3757: [HUDI-2005] Avoiding direct fs calls in HoodieLogFileReader

2021-10-24 Thread GitBox


nsivabalan commented on a change in pull request #3757:
URL: https://github.com/apache/hudi/pull/3757#discussion_r735263839



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeSplit.java
##
@@ -41,6 +42,8 @@
*/
   List getDeltaLogPaths();

Review comment:
   yes, will take it up as a [follow 
up.](https://issues.apache.org/jira/browse/HUDI-2613)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-2613) Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus

2021-10-24 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-2613:
-

Assignee: sivabalan narayanan

> Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus
> 
>
> Key: HUDI-2613
> URL: https://issues.apache.org/jira/browse/HUDI-2613
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>
> Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus instead of 
> getDeltalogs()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2613) Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus

2021-10-24 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-2613:
--
Parent: HUDI-1292
Issue Type: Sub-task  (was: Improvement)

> Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus
> 
>
> Key: HUDI-2613
> URL: https://issues.apache.org/jira/browse/HUDI-2613
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Priority: Major
>
> Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus instead of 
> getDeltalogs()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2613) Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus

2021-10-24 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-2613:
--
Fix Version/s: 0.10.0

> Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus
> 
>
> Key: HUDI-2613
> URL: https://issues.apache.org/jira/browse/HUDI-2613
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.10.0
>
>
> Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus instead of 
> getDeltalogs()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2613) Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus

2021-10-24 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-2613:
-

 Summary: Fix usages of RealtimeSplit to use the new 
getDeltaLogFileStatus
 Key: HUDI-2613
 URL: https://issues.apache.org/jira/browse/HUDI-2613
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: sivabalan narayanan


Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus instead of 
getDeltalogs()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] hudi-bot edited a comment on pull request #3802: [HUDI-1500] Support replace commit in DeltaSync with commit metadata preserved

2021-10-24 Thread GitBox


hudi-bot edited a comment on pull request #3802:
URL: https://github.com/apache/hudi/pull/3802#issuecomment-943342747


   
   ## CI report:
   
   * e906d363c06635bbcc7c69db5fcc4ff0f0f2d919 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2741)
 
   * b63edfaca889ac6444b61a525cc9ee1065f610db Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2824)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3802: [HUDI-1500] Support replace commit in DeltaSync with commit metadata preserved

2021-10-24 Thread GitBox


hudi-bot edited a comment on pull request #3802:
URL: https://github.com/apache/hudi/pull/3802#issuecomment-943342747


   
   ## CI report:
   
   * e906d363c06635bbcc7c69db5fcc4ff0f0f2d919 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2741)
 
   * b63edfaca889ac6444b61a525cc9ee1065f610db UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3813: [HUDI-2563][hudi-client] Refactor CompactionTriggerStrategy.

2021-10-24 Thread GitBox


hudi-bot edited a comment on pull request #3813:
URL: https://github.com/apache/hudi/pull/3813#issuecomment-944948402


   
   ## CI report:
   
   * 7a7ee072ae225fe015b73545ac8d50acc5746ea7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2822)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [HUDI-2077] Fix TestHoodieDeltaStreamerWithMultiWriter (#3849)

2021-10-24 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new d856037  [HUDI-2077] Fix TestHoodieDeltaStreamerWithMultiWriter (#3849)
d856037 is described below

commit d8560377c306e49b7e58448b6897e9c0e7719f61
Author: Raymond Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Sun Oct 24 21:14:39 2021 -0700

[HUDI-2077] Fix TestHoodieDeltaStreamerWithMultiWriter (#3849)

Remove the logic of using deltastreamer to prep test table. Use fixture 
(compressed test table) instead.
---
 .../SparkClientFunctionalTestHarness.java  |   8 +-
 .../hudi/testutils/providers/SparkProvider.java|   4 +-
 .../apache/hudi/common/testutils/FixtureUtils.java |  81 +
 .../common/testutils/HoodieTestDataGenerator.java  |   5 +-
 .../functional/TestHoodieDeltaStreamer.java|  12 +--
 .../TestHoodieDeltaStreamerWithMultiWriter.java|  96 ++---
 .../functional/TestJdbcbasedSchemaProvider.java|  11 ++-
 .../testutils/sources/AbstractBaseTestSource.java  |  26 +-
 ...inuousModeWithMultipleWriters.COPY_ON_WRITE.zip | Bin 0 -> 2494616 bytes
 ...inuousModeWithMultipleWriters.MERGE_ON_READ.zip | Bin 0 -> 2910151 bytes
 10 files changed, 178 insertions(+), 65 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/SparkClientFunctionalTestHarness.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/SparkClientFunctionalTestHarness.java
index 74ab52d..aca1d83 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/SparkClientFunctionalTestHarness.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/SparkClientFunctionalTestHarness.java
@@ -176,8 +176,14 @@ public class SparkClientFunctionalTestHarness implements 
SparkProvider, HoodieMe
 }
   }
 
+  /**
+   * To clean up Spark resources after all testcases have run in functional 
tests.
+   *
+   * Spark session and contexts were reused for testcases in the same test 
class. Some
+   * testcase may invoke this specifically to clean up in case of repeated 
test runs.
+   */
   @AfterAll
-  public static synchronized void cleanUpAfterAll() {
+  public static synchronized void resetSpark() {
 if (spark != null) {
   spark.close();
   spark = null;
diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/providers/SparkProvider.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/providers/SparkProvider.java
index be15dc8..92b1f76 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/providers/SparkProvider.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/providers/SparkProvider.java
@@ -39,6 +39,8 @@ public interface SparkProvider extends 
org.apache.hudi.testutils.providers.Hoodi
 SparkConf sparkConf = new SparkConf();
 sparkConf.set("spark.app.name", getClass().getName());
 sparkConf.set("spark.master", "local[*]");
+sparkConf.set("spark.default.parallelism", "4");
+sparkConf.set("spark.sql.shuffle.partitions", "4");
 sparkConf.set("spark.driver.maxResultSize", "2g");
 sparkConf.set("spark.hadoop.mapred.output.compress", "true");
 sparkConf.set("spark.hadoop.mapred.output.compression.codec", "true");
@@ -52,4 +54,4 @@ public interface SparkProvider extends 
org.apache.hudi.testutils.providers.Hoodi
   default SparkConf conf() {
 return conf(Collections.emptyMap());
   }
-}
\ No newline at end of file
+}
diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/testutils/FixtureUtils.java 
b/hudi-common/src/test/java/org/apache/hudi/common/testutils/FixtureUtils.java
new file mode 100644
index 000..6dfe0da
--- /dev/null
+++ 
b/hudi-common/src/test/java/org/apache/hudi/common/testutils/FixtureUtils.java
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.testutils;
+
+import java.io.File;
+import java.io.FileInputStream;
+

[GitHub] [hudi] xushiyan merged pull request #3849: [HUDI-2077] Fix TestHoodieDeltaStreamerWithMultiWriter

2021-10-24 Thread GitBox


xushiyan merged pull request #3849:
URL: https://github.com/apache/hudi/pull/3849


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on pull request #3849: [HUDI-2077] Fix TestHoodieDeltaStreamerWithMultiWriter

2021-10-24 Thread GitBox


xushiyan commented on pull request #3849:
URL: https://github.com/apache/hudi/pull/3849#issuecomment-950511022


   Build passed 
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=2820&view=results


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Cherry-Puppy removed a comment on issue #3680: [SUPPORT]Failed to sync data to hive-3.1.2 by flink-sql

2021-10-24 Thread GitBox


Cherry-Puppy removed a comment on issue #3680:
URL: https://github.com/apache/hudi/issues/3680#issuecomment-950498398


   I also encountered this problem. I still can't find this class after 
changing the hive version. But there is this class in the jar package.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Cherry-Puppy commented on issue #3680: [SUPPORT]Failed to sync data to hive-3.1.2 by flink-sql

2021-10-24 Thread GitBox


Cherry-Puppy commented on issue #3680:
URL: https://github.com/apache/hudi/issues/3680#issuecomment-950503446


   @danny0405 I also encountered this problem. I still can't find this class 
after changing the hive version. But there is this class in the jar package.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Cherry-Puppy commented on issue #3680: [SUPPORT]Failed to sync data to hive-3.1.2 by flink-sql

2021-10-24 Thread GitBox


Cherry-Puppy commented on issue #3680:
URL: https://github.com/apache/hudi/issues/3680#issuecomment-950498398


   I also encountered this problem. I still can't find this class after 
changing the hive version. But there is this class in the jar package.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3813: [HUDI-2563][hudi-client] Refactor CompactionTriggerStrategy.

2021-10-24 Thread GitBox


hudi-bot edited a comment on pull request #3813:
URL: https://github.com/apache/hudi/pull/3813#issuecomment-944948402


   
   ## CI report:
   
   * 822dbe03dc77531858ffd83ebeb91f210f4e7851 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2818)
 
   * 7a7ee072ae225fe015b73545ac8d50acc5746ea7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2822)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3813: [HUDI-2563][hudi-client] Refactor CompactionTriggerStrategy.

2021-10-24 Thread GitBox


hudi-bot edited a comment on pull request #3813:
URL: https://github.com/apache/hudi/pull/3813#issuecomment-944948402


   
   ## CI report:
   
   * 822dbe03dc77531858ffd83ebeb91f210f4e7851 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2818)
 
   * 7a7ee072ae225fe015b73545ac8d50acc5746ea7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3330: [HUDI-2101][RFC-28]support z-order for hudi

2021-10-24 Thread GitBox


hudi-bot edited a comment on pull request #3330:
URL: https://github.com/apache/hudi/pull/3330#issuecomment-885350571


   
   ## CI report:
   
   * 133379deca564ca42f10a1f3e59bb4aa17d80964 UNKNOWN
   * e555754a4ea179e5251cd7bbff7e8d20c02ef7c8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2821)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] boneanxs opened a new issue #3856: [SUPPORT] Maybe should cache baseDir in nonHoodiePathCache in HoodieROTablePathFilter?

2021-10-24 Thread GitBox


boneanxs opened a new issue #3856:
URL: https://github.com/apache/hudi/issues/3856


   For a non hoodie table, with table path: `hdfs://test/warehouse/db/table`, 3 
partition columns(p1, p2, p3), for a specific partition, like(p1=A, p2=B, 
p3=C), the path should be `hdfs://test/warehouse/db/table/p1=A/p2=B/p3=C`, 
HoodieROTablePathFilter will check baseDir(hdfs://test/warehouse/db/table) is a 
valid HoodieTable path or not, othervise, cache 
`hdfs://test/warehouse/db/table/p1=A/p2=B/p3=C` in nonHoodiePathCache.
   
   I'm wondering why don't we cache baseDir in nonHoodiePathCache, if we cache 
baseDir, for other partitions(like p1=A1, p2=B1, p3=C1), we only check if 
baseDir in nonHoodiePathCache or not.
   
   Pls correct me if I'm wrong.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] dongkelun commented on a change in pull request #3700: [HUDI-2471] Add support ignoring case when column name matches in merge into

2021-10-24 Thread GitBox


dongkelun commented on a change in pull request #3700:
URL: https://github.com/apache/hudi/pull/3700#discussion_r735231887



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala
##
@@ -163,15 +163,15 @@ case class HoodieResolveReferences(sparkSession: 
SparkSession) extends Rule[Logi
   // assignments is empty means insert * or update set *
   val resolvedSourceOutputWithoutMetaFields = 
resolvedSource.output.filter(attr => !HoodieSqlUtils.isMetaField(attr.name))
   val targetOutputWithoutMetaFields = target.output.filter(attr => 
!HoodieSqlUtils.isMetaField(attr.name))
-  val resolvedSourceColumnNamesWithoutMetaFields = 
resolvedSourceOutputWithoutMetaFields.map(_.name)
-  val targetColumnNamesWithoutMetaFields = 
targetOutputWithoutMetaFields.map(_.name)
+  val resolvedSourceColumnNamesWithoutMetaFields = 
resolvedSourceOutputWithoutMetaFields.map(_.name.toLowerCase)

Review comment:
   @YannByron  yes,it can work in column name matching.Do I need to add a 
test case for upper case column name definition? 
   However, ignoring case matching has not been implemented in condition and 
action. I think we should support it. Do I submit another PR or support it in 
this PR?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-2612) No need to define primary key for flink insert operation

2021-10-24 Thread Danny Chen (Jira)
Danny Chen created HUDI-2612:


 Summary: No need to define primary key for flink insert operation
 Key: HUDI-2612
 URL: https://issues.apache.org/jira/browse/HUDI-2612
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Flink Integration
Reporter: Danny Chen
 Fix For: 0.10.0


There is one exception: the MOR table may still needs the pk to generate 
{{HoodieKey}} for #preCombine and compaction merge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] vinothchandar commented on pull request #3330: [HUDI-2101][RFC-28]support z-order for hudi

2021-10-24 Thread GitBox


vinothchandar commented on pull request #3330:
URL: https://github.com/apache/hudi/pull/3330#issuecomment-950475158


   Thanks for your patience. Definitely on it. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] YannByron commented on a change in pull request #3700: [HUDI-2471] Add support ignoring case when column name matches in merge into

2021-10-24 Thread GitBox


YannByron commented on a change in pull request #3700:
URL: https://github.com/apache/hudi/pull/3700#discussion_r735220808



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala
##
@@ -163,15 +163,15 @@ case class HoodieResolveReferences(sparkSession: 
SparkSession) extends Rule[Logi
   // assignments is empty means insert * or update set *
   val resolvedSourceOutputWithoutMetaFields = 
resolvedSource.output.filter(attr => !HoodieSqlUtils.isMetaField(attr.name))
   val targetOutputWithoutMetaFields = target.output.filter(attr => 
!HoodieSqlUtils.isMetaField(attr.name))
-  val resolvedSourceColumnNamesWithoutMetaFields = 
resolvedSourceOutputWithoutMetaFields.map(_.name)
-  val targetColumnNamesWithoutMetaFields = 
targetOutputWithoutMetaFields.map(_.name)
+  val resolvedSourceColumnNamesWithoutMetaFields = 
resolvedSourceOutputWithoutMetaFields.map(_.name.toLowerCase)

Review comment:
   If the table field is defined in uppercase letters, does that work?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3330: [HUDI-2101][RFC-28]support z-order for hudi

2021-10-24 Thread GitBox


hudi-bot edited a comment on pull request #3330:
URL: https://github.com/apache/hudi/pull/3330#issuecomment-885350571


   
   ## CI report:
   
   * 133379deca564ca42f10a1f3e59bb4aa17d80964 UNKNOWN
   * 8236ece4816e100af13702bf92fdddf9c5e14eaf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2428)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2430)
 
   * e555754a4ea179e5251cd7bbff7e8d20c02ef7c8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2821)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xiarixiaoyao commented on pull request #3330: [HUDI-2101][RFC-28]support z-order for hudi

2021-10-24 Thread GitBox


xiarixiaoyao commented on pull request #3330:
URL: https://github.com/apache/hudi/pull/3330#issuecomment-950471560


   @vinothchandar   already rebase the code. could you help me review this 
code, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3330: [HUDI-2101][RFC-28]support z-order for hudi

2021-10-24 Thread GitBox


hudi-bot edited a comment on pull request #3330:
URL: https://github.com/apache/hudi/pull/3330#issuecomment-885350571


   
   ## CI report:
   
   * 133379deca564ca42f10a1f3e59bb4aa17d80964 UNKNOWN
   * 8236ece4816e100af13702bf92fdddf9c5e14eaf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2428)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2430)
 
   * e555754a4ea179e5251cd7bbff7e8d20c02ef7c8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3849: [HUDI-2077] Fix TestHoodieDeltaStreamerWithMultiWriter

2021-10-24 Thread GitBox


hudi-bot edited a comment on pull request #3849:
URL: https://github.com/apache/hudi/pull/3849#issuecomment-950068934


   
   ## CI report:
   
   * f623c7545b41a70eb607d530428536567e70fb7a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2820)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on issue #3854: [SUPPORT] Lower performance using 0.9.0 vs 0.8.0

2021-10-24 Thread GitBox


xushiyan commented on issue #3854:
URL: https://github.com/apache/hudi/issues/3854#issuecomment-950455798


   @Limess thanks for providing benchmarks! 
   
   > bulk inserts are slightly faster with Hudi 0.9.0
   
   This is most likely due to row writer enabled by default in 0.9.0 
https://hudi.apache.org/docs/configurations#hoodiedatasourcewriterowwriterenable
 that boots bulkinsert in 0.9
   
   @nsivabalan do you have any hints on what changes in 0.9 might cause the 
slower writes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3849: [HUDI-2077] Fix TestHoodieDeltaStreamerWithMultiWriter

2021-10-24 Thread GitBox


hudi-bot edited a comment on pull request #3849:
URL: https://github.com/apache/hudi/pull/3849#issuecomment-950068934


   
   ## CI report:
   
   * 16b061c5fa2b1d77755913cb6bda1025c4baf526 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2815)
 
   * f623c7545b41a70eb607d530428536567e70fb7a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2820)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3849: [HUDI-2077] Fix TestHoodieDeltaStreamerWithMultiWriter

2021-10-24 Thread GitBox


hudi-bot edited a comment on pull request #3849:
URL: https://github.com/apache/hudi/pull/3849#issuecomment-950068934


   
   ## CI report:
   
   * 16b061c5fa2b1d77755913cb6bda1025c4baf526 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2815)
 
   * f623c7545b41a70eb607d530428536567e70fb7a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan closed issue #3845: [SUPPORT]`if not exists` doesn't work on create table in spark-sql

2021-10-24 Thread GitBox


xushiyan closed issue #3845:
URL: https://github.com/apache/hudi/issues/3845


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on issue #3845: [SUPPORT]`if not exists` doesn't work on create table in spark-sql

2021-10-24 Thread GitBox


xushiyan commented on issue #3845:
URL: https://github.com/apache/hudi/issues/3845#issuecomment-950442738


   @mutoulbj @BenjMaq Thanks for raising this! It does make sense to print a 
message indicating table exists instead of errorring. Filing a JIRA and please 
feel free to take it if you're interested!
   
   https://issues.apache.org/jira/browse/HUDI-2611


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-2611) `create table if not exists` should print message instead of throwing error

2021-10-24 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2611:


 Summary: `create table if not exists` should print message instead 
of throwing error
 Key: HUDI-2611
 URL: https://issues.apache.org/jira/browse/HUDI-2611
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Spark Integration
Reporter: Raymond Xu


See details in

https://github.com/apache/hudi/issues/3845#issue-1033218877



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] xushiyan closed issue #3662: [SUPPORT] Error on the spark version in the desc information of the hudi CTAS Table

2021-10-24 Thread GitBox


xushiyan closed issue #3662:
URL: https://github.com/apache/hudi/issues/3662


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on issue #3662: [SUPPORT] Error on the spark version in the desc information of the hudi CTAS Table

2021-10-24 Thread GitBox


xushiyan commented on issue #3662:
URL: https://github.com/apache/hudi/issues/3662#issuecomment-950438362


   @kelvin-qin thanks for reproducing this! i see it's not the right spark 
version info if CTAS from a hudi table. the version info not propagated 
correctly. I can also reproduce it; It'd be a nice fix. Filing a JIRA now. If 
you're keen, please feel free to take it.
   https://issues.apache.org/jira/browse/HUDI-2610


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-2610) Fix Spark version info for hudi table CTAS from another hudi table

2021-10-24 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2610:


 Summary: Fix Spark version info for hudi table CTAS from another 
hudi table
 Key: HUDI-2610
 URL: https://issues.apache.org/jira/browse/HUDI-2610
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Spark Integration
Reporter: Raymond Xu


See details in the original issue

 

https://github.com/apache/hudi/issues/3662#issuecomment-938489457



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan commented on a change in pull request #3849: [HUDI-2077] Fix TestHoodieDeltaStreamerWithMultiWriter

2021-10-24 Thread GitBox


nsivabalan commented on a change in pull request #3849:
URL: https://github.com/apache/hudi/pull/3849#discussion_r735196225



##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/TestHoodieDeltaStreamerWithMultiWriter.java
##
@@ -254,6 +254,16 @@ private static TypedProperties 
prepareMultiWriterProps(FileSystem fs, String bas
 return cfg;
   }
 
+  /**
+   * Specifically used for {@link TestHoodieDeltaStreamerWithMultiWriter}.
+   *
+   * The fixture test tables have random records generated by
+   * {@link org.apache.hudi.common.testutils.HoodieTestDataGenerator} using
+   * {@link 
org.apache.hudi.common.testutils.HoodieTestDataGenerator#TRIP_EXAMPLE_SCHEMA}.
+   *
+   * The COW fixture test table has 3000 unique records in 7 compaction 
commits.

Review comment:
   COW can't have any compaction commits. Did you mean just regular 
commits? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan closed issue #3392: [SUPPORT] Compile hudi master with hive version 2.1.1 error

2021-10-24 Thread GitBox


xushiyan closed issue #3392:
URL: https://github.com/apache/hudi/issues/3392


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on issue #3392: [SUPPORT] Compile hudi master with hive version 2.1.1 error

2021-10-24 Thread GitBox


xushiyan commented on issue #3392:
URL: https://github.com/apache/hudi/issues/3392#issuecomment-950421639


   Close due to inactive


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on issue #3760: [SUPPORT] Pushing hoodie metrics to prometheus having error

2021-10-24 Thread GitBox


xushiyan commented on issue #3760:
URL: https://github.com/apache/hudi/issues/3760#issuecomment-950421068


   > I think spark never try to write to prometheus, even if I put a wrong 
address, no error.
   
   @rubenssoto can you share your settings? @liujinhui1994 could you give any 
suggestions or hint to the prometheus problems above?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on issue #3760: [SUPPORT] Pushing hoodie metrics to prometheus having error

2021-10-24 Thread GitBox


xushiyan commented on issue #3760:
URL: https://github.com/apache/hudi/issues/3760#issuecomment-950420645


   @data-storyteller @rubenssoto can you check out this guide prepared by 
@nsivabalan (to be merged to website) and see the instructions help? 
https://github.com/apache/hudi/commit/959bd6eef8c90c11616840f975ef40a46222a913?short_path=aff66ea#diff-aff66ea1c34953a024c85c6e2fe86b8521b6cd3d623377a96d8d79c6caa8de13
   
   @data-storyteller 
   
   ```
   Exception in thread "main" java.lang.NoSuchMethodError: 'void 
io.prometheus.client.dropwizard.DropwizardExports.(org.apache.hudi.com.codahale.metrics.MetricRegistry)'
   ```
   
   Looks like it's a jar issue. Are you using hudi bundle jar? can you print 
your classpath too?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan closed issue #3676: MOR table rolls out new parquet files at 10MB for new inserts - even though max file size set as 128MB

2021-10-24 Thread GitBox


xushiyan closed issue #3676:
URL: https://github.com/apache/hudi/issues/3676


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on issue #3676: MOR table rolls out new parquet files at 10MB for new inserts - even though max file size set as 128MB

2021-10-24 Thread GitBox


xushiyan commented on issue #3676:
URL: https://github.com/apache/hudi/issues/3676#issuecomment-950417563


   @nsivabalan i also filed https://issues.apache.org/jira/browse/HUDI-2609 to 
make docs clearer on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2609) Clarify small file configs in config page

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2609:
-
Labels: user-support-issues  (was: )

> Clarify small file configs in config page
> -
>
> Key: HUDI-2609
> URL: https://issues.apache.org/jira/browse/HUDI-2609
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: Raymond Xu
>Priority: Minor
>  Labels: user-support-issues
>
> The knowledge should be preserved in docs close to the related config keys
> https://github.com/apache/hudi/issues/3676#issuecomment-922508543



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2609) Clarify small file configs in config page

2021-10-24 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2609:


 Summary: Clarify small file configs in config page
 Key: HUDI-2609
 URL: https://issues.apache.org/jira/browse/HUDI-2609
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Docs
Reporter: Raymond Xu


The knowledge should be preserved in docs close to the related config keys

https://github.com/apache/hudi/issues/3676#issuecomment-922508543



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-2607) Reorganize Hudi docs

2021-10-24 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra reassigned HUDI-2607:
-

Assignee: Kyle Weller

> Reorganize Hudi docs
> 
>
> Key: HUDI-2607
> URL: https://issues.apache.org/jira/browse/HUDI-2607
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Kyle Weller
>Assignee: Kyle Weller
>Priority: Minor
>  Labels: pull-request-available
>
> Reorganize Hudi docs so they are more accessible and easier to find what you 
> need.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] xushiyan commented on issue #3191: [SUPPORT]client spark-submit cmd error:Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.DataSourceUtils$.PARTITIONI

2021-10-24 Thread GitBox


xushiyan commented on issue #3191:
URL: https://github.com/apache/hudi/issues/3191#issuecomment-950416601


   @xer001 `PARTITIONING_COLUMNS_KEY` is **not** added in spark 2.4.0 see 
https://jar-download.com/artifacts/org.apache.spark/spark-sql_2.11/2.4.0/source-code/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
   It was added in 2.4.2 and later
   Can you please upgrade your spark to newer versions of `2.4.x` ?
   
   @mdz-doit `PARTITIONING_COLUMNS_KEY` **is** there in 2.4.5 see 
https://jar-download.com/artifacts/org.apache.spark/spark-sql_2.11/2.4.5/source-code/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
   
   Can you print your spark version from spark shell to make sure you have the 
right one? Or do you have other dependencies on your classpath? you can print 
the whole path to inspect.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan edited a comment on issue #3835: Hudi deltastreamer using avro schema parser when using jsonKafkaSource

2021-10-24 Thread GitBox


xushiyan edited a comment on issue #3835:
URL: https://github.com/apache/hudi/issues/3835#issuecomment-950410484


   @shivabodepudi I see. The problem is you're using Json schema. The schema 
provider `org.apache.hudi.schema.SchemaProvider` defines only avro schema to be 
provided. You could extend `org.apache.hudi.schema.SchemaRegistryProvider` to 
convert the json schema into avro by overriding 
`org.apache.hudi.schema.SchemaRegistryProvider#fetchSchemaFromRegistry`
   
   Meanwhile i do think support json schema makes sense as we support 
JsonSource anyway. Filing a JIRA for this. 
https://issues.apache.org/jira/browse/HUDI-2608 @shivabodepudi if you 
interested, feel free to pick up this feature.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan edited a comment on issue #3835: Hudi deltastreamer using avro schema parser when using jsonKafkaSource

2021-10-24 Thread GitBox


xushiyan edited a comment on issue #3835:
URL: https://github.com/apache/hudi/issues/3835#issuecomment-950410484


   @shivabodepudi I see. The problem is only Avro schema is supported and 
you're using Json schema. The schema provider 
`org.apache.hudi.schema.SchemaProvider` defines only avro schema to be 
provided. You could extend `org.apache.hudi.schema.SchemaRegistryProvider` to 
convert the json schema into avro by overriding 
`org.apache.hudi.schema.SchemaRegistryProvider#fetchSchemaFromRegistry`
   
   Meanwhile i do think support json schema makes sense as we support 
JsonSource anyway. Filing a JIRA for this. 
https://issues.apache.org/jira/browse/HUDI-2608 @shivabodepudi if you 
interested, feel free to pick up this feature.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on issue #3835: Hudi deltastreamer using avro schema parser when using jsonKafkaSource

2021-10-24 Thread GitBox


xushiyan commented on issue #3835:
URL: https://github.com/apache/hudi/issues/3835#issuecomment-950410484


   @shivabodepudi I see. The problem is only Avro schema is supported and 
you're using Json schema. The schema provider 
`org.apache.hudi.schema.SchemaProvider` defines only avro schema to be 
provided. You could extend `org.apache.hudi.schema.SchemaRegistryProvider` to 
convert the json schema into avro. 
   
   Meanwhile i do think support json schema makes sense as we support 
JsonSource anyway. Filing a JIRA for this. 
https://issues.apache.org/jira/browse/HUDI-2608 @shivabodepudi if you 
interested, feel free to pick up this feature.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan closed issue #3835: Hudi deltastreamer using avro schema parser when using jsonKafkaSource

2021-10-24 Thread GitBox


xushiyan closed issue #3835:
URL: https://github.com/apache/hudi/issues/3835


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2608) Support JSON schema in schema registry provider

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2608:
-
Description: 
To work with JSON kafka source.

 

Original issue

https://github.com/apache/hudi/issues/3835

> Support JSON schema in schema registry provider
> ---
>
> Key: HUDI-2608
> URL: https://issues.apache.org/jira/browse/HUDI-2608
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: Raymond Xu
>Priority: Major
>
> To work with JSON kafka source.
>  
> Original issue
> https://github.com/apache/hudi/issues/3835



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2608) Support JSON schema in schema registry provider

2021-10-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2608:
-
Labels: sev:normal user-support-issues  (was: )

> Support JSON schema in schema registry provider
> ---
>
> Key: HUDI-2608
> URL: https://issues.apache.org/jira/browse/HUDI-2608
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: Raymond Xu
>Priority: Major
>  Labels: sev:normal, user-support-issues
>
> To work with JSON kafka source.
>  
> Original issue
> https://github.com/apache/hudi/issues/3835



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2608) Support JSON schema in schema registry provider

2021-10-24 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2608:


 Summary: Support JSON schema in schema registry provider
 Key: HUDI-2608
 URL: https://issues.apache.org/jira/browse/HUDI-2608
 Project: Apache Hudi
  Issue Type: New Feature
  Components: DeltaStreamer
Reporter: Raymond Xu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[hudi] branch asf-site updated: [DOCS] Update azure_hoodie.md and docker_demo.md of cn doc (#3851)

2021-10-24 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 4814dff  [DOCS] Update azure_hoodie.md and docker_demo.md of cn doc 
(#3851)
4814dff is described below

commit 4814dff7dfc1812ba85077fc3ac1910721a81662
Author: laurieliyang <11391675+laurieliy...@users.noreply.github.com>
AuthorDate: Mon Oct 25 06:20:56 2021 +0800

[DOCS] Update azure_hoodie.md and docker_demo.md of cn doc (#3851)

* Update cn doc azure_hoodie.md of current and 0.8.0
* Remove version matter of azure_hoodie of current
---
 .../current/azure_hoodie.md|  35 ++--
 .../current/docker_demo.md | 215 +
 .../version-0.8.0/azure_hoodie.md  |  35 ++--
 3 files changed, 127 insertions(+), 158 deletions(-)

diff --git 
a/website/i18n/cn/docusaurus-plugin-content-docs/current/azure_hoodie.md 
b/website/i18n/cn/docusaurus-plugin-content-docs/current/azure_hoodie.md
index cbda98a..f7ccb84 100644
--- a/website/i18n/cn/docusaurus-plugin-content-docs/current/azure_hoodie.md
+++ b/website/i18n/cn/docusaurus-plugin-content-docs/current/azure_hoodie.md
@@ -1,41 +1,42 @@
 ---
-title: Azure Filesystem
+title: Azure 文件系统
 keywords: [ hudi, hive, azure, spark, presto]
-summary: In this page, we go over how to configure Hudi with Azure filesystem.
+summary: 在本页中,我们讨论如何在 Azure 文件系统中配置 Hudi 。
 last_modified_at: 2020-05-25T19:00:57-04:00
 language: cn
 ---
-In this page, we explain how to use Hudi on Microsoft Azure.
+在本页中,我们解释如何在 Microsoft Azure 上使用 Hudi 。
 
-## Disclaimer
+## 声明
 
-This page is maintained by the Hudi community.
-If the information is inaccurate or you have additional information to add.
-Please feel free to create a JIRA ticket. Contribution is highly appreciated.
+本页面由 Hudi 社区维护。
+如果信息不准确,或者你有信息要补充,请尽管创建 JIRA ticket。
+对此贡献高度赞赏。
 
-## Supported Storage System
+## 支持的存储系统
 
-There are two storage systems support Hudi .
+Hudi 支持两种存储系统。
 
-- Azure Blob Storage
+- Azure Blob 存储
 - Azure Data Lake Gen 2
 
-## Verified Combination of Spark and storage system
+## 经过验证的 Spark 与存储系统的组合
 
- HDInsight Spark2.4 on Azure Data Lake Storage Gen 2
+ Azure Data Lake Storage Gen 2 上的 HDInsight Spark 2.4
 This combination works out of the box. No extra config needed.
+这种组合开箱即用,不需要额外的配置。
 
- Databricks Spark2.4 on Azure Data Lake Storage Gen 2
-- Import Hudi jar to databricks workspace
+ Azure Data Lake Storage Gen 2 上的 Databricks Spark 2.4
+- 将 Hudi jar 包导入到 databricks 工作区 。
 
-- Mount the file system to dbutils.
+- 将文件系统挂载到 dbutils 。
   ```scala
   dbutils.fs.mount(
 source = "abfss://x...@xxx.dfs.core.windows.net",
 mountPoint = "/mountpoint",
 extraConfigs = configs)
   ```
-- When writing Hudi dataset, use abfss URL
+- 当写入 Hudi 数据集时,使用 abfss URL
   ```scala
   inputDF.write
 .format("org.apache.hudi")
@@ -43,7 +44,7 @@ This combination works out of the box. No extra config needed.
 .mode(SaveMode.Append)
 
.save("abfss://<>.dfs.core.windows.net/hudi-tables/customer")
   ```
-- When reading Hudi dataset, use the mounting point
+- 当读取 Hudi 数据集时,使用挂载点
   ```scala
   spark.read
 .format("org.apache.hudi")
diff --git 
a/website/i18n/cn/docusaurus-plugin-content-docs/current/docker_demo.md 
b/website/i18n/cn/docusaurus-plugin-content-docs/current/docker_demo.md
index 3b8d1f0..eea0e88 100644
--- a/website/i18n/cn/docusaurus-plugin-content-docs/current/docker_demo.md
+++ b/website/i18n/cn/docusaurus-plugin-content-docs/current/docker_demo.md
@@ -6,18 +6,17 @@ last_modified_at: 2019-12-30T15:59:57-04:00
 language: cn
 ---
 
-## A Demo using docker containers
+## 一个使用 Docker 容器的 Demo
 
-Lets use a real world example to see how hudi works end to end. For this 
purpose, a self contained
-data infrastructure is brought up in a local docker cluster within your 
computer.
+我们来使用一个真实世界的案例,来看看 Hudi 是如何闭环运转的。 为了这个目的,在你的计算机中的本地 Docker 集群中组建了一个自包含的数据基础设施。
 
-The steps have been tested on a Mac laptop
+以下步骤已经在一台 Mac 笔记本电脑上测试过了。
 
-### Prerequisites
+### 前提条件
 
-  * Docker Setup :  For Mac, Please follow the steps as defined in 
[https://docs.docker.com/v17.12/docker-for-mac/install/]. For running Spark-SQL 
queries, please ensure atleast 6 GB and 4 CPUs are allocated to Docker (See 
Docker -> Preferences -> Advanced). Otherwise, spark-SQL queries could be 
killed because of memory issues.
-  * kafkacat : A command-line utility to publish/consume from kafka topics. 
Use `brew install kafkacat` to install kafkacat
-  * /etc/hosts : The demo references many services running in container by the 
hostname. Add the following settings to /etc/hosts
+  * Docker 安装 :  对于 Mac ,请依照 
[https://docs.docker.com/v17.12/docker-for-mac/install/] 当中定义的步骤。 为了运行 
Spark-SQL 查询,请确保至少分配给 Docker 6 GB 和 4 个 CPU 。(参见 Docker -> Preferences -> 
Advanced)。否则,Spark-SQL 查询可能

  1   2   >