from:"Wenning Ding \(Jira\)"

[jira] [Created] (HUDI-4382) Add logger for HoodieCopyOnWriteTableInputFormat

2022-07-11 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-4382:
--

 Summary: Add logger for HoodieCopyOnWriteTableInputFormat
 Key: HUDI-4382
 URL: https://issues.apache.org/jira/browse/HUDI-4382
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Wenning Ding


When querying the ro and rt bootstrap mor tables using presto I observed both 
are failed with the following excecption:

{{java.lang.NoSuchFieldError: LOG
at 
org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.makeExternalFileSplit(HoodieCopyOnWriteTableInputFormat.java:199)
at 
org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.makeSplit(HoodieCopyOnWriteTableInputFormat.java:100)
at 
org.apache.hudi.hadoop.realtime.HoodieMergeOnReadTableInputFormat.doMakeSplitForRealtimePath(HoodieMergeOnReadTableInputFormat.java:266)
at 
org.apache.hudi.hadoop.realtime.HoodieMergeOnReadTableInputFormat.makeSplit(HoodieMergeOnReadTableInputFormat.java:211)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:345)
at 
org.apache.hudi.hadoop.realtime.HoodieMergeOnReadTableInputFormat.getSplits(HoodieMergeOnReadTableInputFormat.java:79)
at 
org.apache.hudi.hadoop.HoodieParquetInputFormatBase.getSplits(HoodieParquetInputFormatBase.java:68)
at 
com.facebook.presto.hive.StoragePartitionLoader.loadPartition(StoragePartitionLoader.java:278)
at 
com.facebook.presto.hive.DelegatingPartitionLoader.loadPartition(DelegatingPartitionLoader.java:81)
at 
com.facebook.presto.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:224)
at 
com.facebook.presto.hive.BackgroundHiveSplitLoader.access$700(BackgroundHiveSplitLoader.java:50)
at 
com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:153)
at 
com.facebook.presto.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:47)
at 
com.facebook.presto.hive.util.ResumableTasks.access$000(ResumableTasks.java:20)
at 
com.facebook.presto.hive.util.ResumableTasks$1.run(ResumableTasks.java:35)
at 
com.facebook.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)}}

 

The reason we saw {{java.lang.NoSuchFieldError: LOG}} during the presto query 
is because in this {{HoodieCopyOnWriteTableInputFormat}} class, it inherits 
field {{LOG}} from its parent class {{FileInputFormat}} which is a class from 
Hadoop.

So in the compile time, it would reference this field from 
{{{}FileInputFormat.class{}}}. However, in the runtime, the presto doesn't have 
all the Hadoop classes in its classpath, what Presto uses is its own Hadoop 
dependency e.g. {{{}hadoop-apache2:jar{}}}. I checked that {{hadoop-apache2}} 
does not have class {{FileInputFormat}} shaded which causes this runtime error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-4331) Allow loading external config file from class loader

2022-06-27 Thread Wenning Ding (Jira)

Title: Message Title


 
 
 
 

 
 
 

 
   
 Wenning Ding created an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Apache Hudi /  HUDI-4331  
 
 
  Allow loading external config file from class loader   
 

  
 
 
 
 

 
Issue Type: 
  Improvement  
 
 
Assignee: 
 Unassigned  
 
 
Created: 
 27/Jun/22 22:57  
 
 
Priority: 
  Major  
 
 
Reporter: 
 Wenning Ding  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)

[jira] [Updated] (HUDI-4278) Add skip archive option when syncing to AWS Glue tables

2022-06-18 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-4278:
---
Description: 
The issue is each time when Hudi upserts records, it would sync to the catalog 
and update {{last_commit_time_sync}} for the Glue table. Each time it updates 
this property, Glue by default would create a new table version and archive old 
versions. So the problem is if customers update the Hudi table frequently, 
eventually they would hit the Glue table version limit.

So here inside Hudi, we pass a parameter {{skipGlueArchive}} to the environment 
context to finally pass it to {{{}AWS Glue metadata service{}}}, so Glue client 
has an option to decide whether to skip archive or not.

> Add skip archive option when syncing to AWS Glue tables
> ---
>
> Key: HUDI-4278
> URL: https://issues.apache.org/jira/browse/HUDI-4278
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Wenning Ding
>Priority: Major
>
> The issue is each time when Hudi upserts records, it would sync to the 
> catalog and update {{last_commit_time_sync}} for the Glue table. Each time it 
> updates this property, Glue by default would create a new table version and 
> archive old versions. So the problem is if customers update the Hudi table 
> frequently, eventually they would hit the Glue table version limit.
> So here inside Hudi, we pass a parameter {{skipGlueArchive}} to the 
> environment context to finally pass it to {{{}AWS Glue metadata service{}}}, 
> so Glue client has an option to decide whether to skip archive or not.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (HUDI-4278) Add skip archive option when syncing to AWS Glue tables

2022-06-18 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-4278:
--

 Summary: Add skip archive option when syncing to AWS Glue tables
 Key: HUDI-4278
 URL: https://issues.apache.org/jira/browse/HUDI-4278
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Wenning Ding






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (HUDI-3764) Allow loading external configs while querying Hudi tables with Spark

2022-03-31 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-3764:
--

 Summary: Allow loading external configs while querying Hudi tables 
with Spark
 Key: HUDI-3764
 URL: https://issues.apache.org/jira/browse/HUDI-3764
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Wenning Ding


Currently when doing Hudi queries w/ Spark, it won't load the external 
configurations. Say if customers enabled metadata listing in their global 
config file, then this would let them actually query w/o metadata feature 
enabled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HUDI-3450) Avoid passing empty string spark master to hudi cli

2022-02-17 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-3450:
--

 Summary: Avoid passing empty string spark master to hudi cli
 Key: HUDI-3450
 URL: https://issues.apache.org/jira/browse/HUDI-3450
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Wenning Ding


When using Hudi CLI, when not passing SparkMaster, by default Hudi CLI should 
use 
[SparkUtil.DEFAULT_SPARK_MASTER|https://github.com/apache/hudi/blob/release-0.10.0/hudi-cli/src/main/java/org/apache/hudi/cli/utils/SparkUtil.java#L44].
 However, w/ a recent [code 
change|https://github.com/apache/hudi/commit/445208a0d20b457daeeb5f70995302c92dd19f31]
 in OSS, when SparkMaster is not passed, it would set Spark master to {{""}} 
which causes the following exception when initializing a Hudi CLI job:

 

{{org.apache.spark.SparkException: Could not parse Master URL: ''at 
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2999)

at org.apache.spark.SparkContext.(SparkContext.scala:567)

at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)

at org.apache.hudi.cli.utils.SparkUtil.initJavaSparkConf(SparkUtil.java:115)

at org.apache.hudi.cli.utils.SparkUtil.initJavaSparkConf(SparkUtil.java:110)

at org.apache.hudi.cli.commands.SparkMain.main(SparkMain.java:88)}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3395) Allow pass rollbackUsingMarkers to Hudi CLI rollback command

2022-02-08 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-3395:
---
Description: 
In a recent OSS 
[change|https://github.com/apache/hudi/commit/ce7d2333078e4e1f16de1bce6d448c5eef1e4111],
 {{hoodie.rollback.using.markers}} is enabled by default. However, if you are 
trying rolling back a completed instance, Hudi would throw this exception:

 

{{Caused by: java.lang.IllegalArgumentException: Cannot use marker based 
rollback strategy on completed instant:[20211228041514616__commit__COMPLETED]at 
org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)}}

The problem in our integration test {{{}testHudiCliRollback{}}}, it tries 
rolling back a complete instance. But right now, through Hudi CLI, the {{commit 
rollback}} command does not provide customers anyway to disable 
{{hoodie.rollback.using.markers}} which means customers can never use {{commit 
rollback}} command to rollback a completed commit.

  was:
In a recent OSS 
[change|https://github.com/apache/hudi/commit/ce7d2333078e4e1f16de1bce6d448c5eef1e4111],
 {{hoodie.rollback.using.markers}} is enabled by default. However, if you are 
trying rolling back a completed instance, Hudi would throw this exception:

 

{{Caused by: java.lang.IllegalArgumentException: Cannot use marker based 
rollback strategy on completed instant:[20211228041514616__commit__COMPLETED]at 
org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)}}

The problem in our integration test {{{}testHudiCliRollback{}}}, it tries 
rolling back a complete instance. But right now, through Hudi CLI, the {{commit 
rollback}} command does not provide customers anyway to disable 
{{hoodie.rollback.using.markers}} which means customers can never use {{commit 
rollback}} command to rollback a completed commit. So in this CR, I added a 
parameter called {{usingMarkers}} so that customers can disable this feature,


> Allow pass rollbackUsingMarkers to Hudi CLI rollback command
> 
>
> Key: HUDI-3395
> URL: https://issues.apache.org/jira/browse/HUDI-3395
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
>
> In a recent OSS 
> [change|https://github.com/apache/hudi/commit/ce7d2333078e4e1f16de1bce6d448c5eef1e4111],
>  {{hoodie.rollback.using.markers}} is enabled by default. However, if you are 
> trying rolling back a completed instance, Hudi would throw this exception:
>  
> {{Caused by: java.lang.IllegalArgumentException: Cannot use marker based 
> rollback strategy on completed 
> instant:[20211228041514616__commit__COMPLETED]at 
> org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)}}
> The problem in our integration test {{{}testHudiCliRollback{}}}, it tries 
> rolling back a complete instance. But right now, through Hudi CLI, the 
> {{commit rollback}} command does not provide customers anyway to disable 
> {{hoodie.rollback.using.markers}} which means customers can never use 
> {{commit rollback}} command to rollback a completed commit.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HUDI-3395) Allow pass rollbackUsingMarkers to Hudi CLI rollback command

2022-02-08 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-3395:
--

 Summary: Allow pass rollbackUsingMarkers to Hudi CLI rollback 
command
 Key: HUDI-3395
 URL: https://issues.apache.org/jira/browse/HUDI-3395
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Wenning Ding


In a recent OSS 
[change|https://github.com/apache/hudi/commit/ce7d2333078e4e1f16de1bce6d448c5eef1e4111],
 {{hoodie.rollback.using.markers}} is enabled by default. However, if you are 
trying rolling back a completed instance, Hudi would throw this exception:

 

{{Caused by: java.lang.IllegalArgumentException: Cannot use marker based 
rollback strategy on completed instant:[20211228041514616__commit__COMPLETED]at 
org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)}}

The problem in our integration test {{{}testHudiCliRollback{}}}, it tries 
rolling back a complete instance. But right now, through Hudi CLI, the {{commit 
rollback}} command does not provide customers anyway to disable 
{{hoodie.rollback.using.markers}} which means customers can never use {{commit 
rollback}} command to rollback a completed commit. So in this CR, I added a 
parameter called {{usingMarkers}} so that customers can disable this feature,



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HUDI-3122) presto query failed for bootstrap tables

2021-12-28 Thread Wenning Ding (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466279#comment-17466279
 ] 

Wenning Ding commented on HUDI-3122:


Thanks I will give a shot

> presto query failed for bootstrap tables
> 
>
> Key: HUDI-3122
> URL: https://issues.apache.org/jira/browse/HUDI-3122
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Wenning Ding
>Priority: Major
>
>  
> {{java.lang.NoClassDefFoundError: 
> org/apache/hudi/org/apache/hadoop/hbase/io/hfile/CacheConfig
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:181)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.access$400(HFileBootstrapIndex.java:76)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.partitionIndexReader(HFileBootstrapIndex.java:272)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.fetchBootstrapIndexInfo(HFileBootstrapIndex.java:262)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.initIndexInfo(HFileBootstrapIndex.java:252)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.(HFileBootstrapIndex.java:243)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:191)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$addFilesToView$2(AbstractTableFileSystemView.java:137)
> at java.util.HashMap.forEach(HashMap.java:1290)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.addFilesToView(AbstractTableFileSystemView.java:134)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$9(AbstractTableFileSystemView.java:294)
> at 
> java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:281)}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HUDI-3122) presto query failed for bootstrap tables

2021-12-28 Thread Wenning Ding (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466275#comment-17466275
 ] 

Wenning Ding commented on HUDI-3122:


I see your point, but for 
[https://github.com/prestodb/presto/blob/release-0.261/presto-hive/pom.xml], 
it's just a dependency right? So in the presto classpath, it still does not 
have the hbase related jars.

We need the hudi-presto-bundle.jar because in the runtime, say when you 
querying a Hudi bootstrap table, it requires hbase/io/hfile/CacheConfig which 
is a class in hbase-server, but by default, presto does not have any hbase jar 
in it's classpath. 

> presto query failed for bootstrap tables
> 
>
> Key: HUDI-3122
> URL: https://issues.apache.org/jira/browse/HUDI-3122
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Wenning Ding
>Priority: Major
>
>  
> {{java.lang.NoClassDefFoundError: 
> org/apache/hudi/org/apache/hadoop/hbase/io/hfile/CacheConfig
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:181)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.access$400(HFileBootstrapIndex.java:76)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.partitionIndexReader(HFileBootstrapIndex.java:272)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.fetchBootstrapIndexInfo(HFileBootstrapIndex.java:262)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.initIndexInfo(HFileBootstrapIndex.java:252)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.(HFileBootstrapIndex.java:243)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:191)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$addFilesToView$2(AbstractTableFileSystemView.java:137)
> at java.util.HashMap.forEach(HashMap.java:1290)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.addFilesToView(AbstractTableFileSystemView.java:134)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$9(AbstractTableFileSystemView.java:294)
> at 
> java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:281)}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (HUDI-3122) presto query failed for bootstrap tables

2021-12-28 Thread Wenning Ding (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466269#comment-17466269
 ] 

Wenning Ding edited comment on HUDI-3122 at 12/29/21, 12:30 AM:


I checked the hudi-presto-bundle.jar, the hbase server related class is not 
there. That's the reason why I saw the classNotFound exception.


was (Author: wenningd):
I checked the hudi-presto-bundle.jar, the hbase server related class is not 
there.

> presto query failed for bootstrap tables
> 
>
> Key: HUDI-3122
> URL: https://issues.apache.org/jira/browse/HUDI-3122
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Wenning Ding
>Priority: Major
>
>  
> {{java.lang.NoClassDefFoundError: 
> org/apache/hudi/org/apache/hadoop/hbase/io/hfile/CacheConfig
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:181)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.access$400(HFileBootstrapIndex.java:76)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.partitionIndexReader(HFileBootstrapIndex.java:272)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.fetchBootstrapIndexInfo(HFileBootstrapIndex.java:262)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.initIndexInfo(HFileBootstrapIndex.java:252)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.(HFileBootstrapIndex.java:243)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:191)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$addFilesToView$2(AbstractTableFileSystemView.java:137)
> at java.util.HashMap.forEach(HashMap.java:1290)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.addFilesToView(AbstractTableFileSystemView.java:134)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$9(AbstractTableFileSystemView.java:294)
> at 
> java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:281)}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HUDI-3122) presto query failed for bootstrap tables

2021-12-28 Thread Wenning Ding (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466269#comment-17466269
 ] 

Wenning Ding commented on HUDI-3122:


I checked the hudi-presto-bundle.jar, the hbase server related class is not 
there.

> presto query failed for bootstrap tables
> 
>
> Key: HUDI-3122
> URL: https://issues.apache.org/jira/browse/HUDI-3122
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Wenning Ding
>Priority: Major
>
>  
> {{java.lang.NoClassDefFoundError: 
> org/apache/hudi/org/apache/hadoop/hbase/io/hfile/CacheConfig
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:181)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.access$400(HFileBootstrapIndex.java:76)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.partitionIndexReader(HFileBootstrapIndex.java:272)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.fetchBootstrapIndexInfo(HFileBootstrapIndex.java:262)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.initIndexInfo(HFileBootstrapIndex.java:252)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.(HFileBootstrapIndex.java:243)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:191)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$addFilesToView$2(AbstractTableFileSystemView.java:137)
> at java.util.HashMap.forEach(HashMap.java:1290)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.addFilesToView(AbstractTableFileSystemView.java:134)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$9(AbstractTableFileSystemView.java:294)
> at 
> java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:281)}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HUDI-3122) presto query failed for bootstrap tables

2021-12-28 Thread Wenning Ding (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466266#comment-17466266
 ] 

Wenning Ding commented on HUDI-3122:


Presto 0.261

> presto query failed for bootstrap tables
> 
>
> Key: HUDI-3122
> URL: https://issues.apache.org/jira/browse/HUDI-3122
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Wenning Ding
>Priority: Major
>
>  
> {{java.lang.NoClassDefFoundError: 
> org/apache/hudi/org/apache/hadoop/hbase/io/hfile/CacheConfig
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:181)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.access$400(HFileBootstrapIndex.java:76)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.partitionIndexReader(HFileBootstrapIndex.java:272)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.fetchBootstrapIndexInfo(HFileBootstrapIndex.java:262)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.initIndexInfo(HFileBootstrapIndex.java:252)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.(HFileBootstrapIndex.java:243)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:191)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$addFilesToView$2(AbstractTableFileSystemView.java:137)
> at java.util.HashMap.forEach(HashMap.java:1290)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.addFilesToView(AbstractTableFileSystemView.java:134)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$9(AbstractTableFileSystemView.java:294)
> at 
> java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:281)}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HUDI-3122) presto query failed for bootstrap tables

2021-12-28 Thread Wenning Ding (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466261#comment-17466261
 ] 

Wenning Ding commented on HUDI-3122:


[~zhangyue19921010] Is removing hbase-server from hudi-presto-bundle.jar 
intended?

> presto query failed for bootstrap tables
> 
>
> Key: HUDI-3122
> URL: https://issues.apache.org/jira/browse/HUDI-3122
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Wenning Ding
>Priority: Major
>
>  
> {{java.lang.NoClassDefFoundError: 
> org/apache/hudi/org/apache/hadoop/hbase/io/hfile/CacheConfig
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:181)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.access$400(HFileBootstrapIndex.java:76)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.partitionIndexReader(HFileBootstrapIndex.java:272)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.fetchBootstrapIndexInfo(HFileBootstrapIndex.java:262)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.initIndexInfo(HFileBootstrapIndex.java:252)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.(HFileBootstrapIndex.java:243)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:191)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$addFilesToView$2(AbstractTableFileSystemView.java:137)
> at java.util.HashMap.forEach(HashMap.java:1290)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.addFilesToView(AbstractTableFileSystemView.java:134)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$9(AbstractTableFileSystemView.java:294)
> at 
> java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:281)}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HUDI-3122) presto query failed for bootstrap tables

2021-12-28 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-3122:
--

 Summary: presto query failed for bootstrap tables
 Key: HUDI-3122
 URL: https://issues.apache.org/jira/browse/HUDI-3122
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Wenning Ding


 

{{java.lang.NoClassDefFoundError: 
org/apache/hudi/org/apache/hadoop/hbase/io/hfile/CacheConfig
at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:181)
at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.access$400(HFileBootstrapIndex.java:76)
at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.partitionIndexReader(HFileBootstrapIndex.java:272)
at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.fetchBootstrapIndexInfo(HFileBootstrapIndex.java:262)
at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.initIndexInfo(HFileBootstrapIndex.java:252)
at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.(HFileBootstrapIndex.java:243)
at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:191)
at 
org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$addFilesToView$2(AbstractTableFileSystemView.java:137)
at java.util.HashMap.forEach(HashMap.java:1290)
at 
org.apache.hudi.common.table.view.AbstractTableFileSystemView.addFilesToView(AbstractTableFileSystemView.java:134)
at 
org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$9(AbstractTableFileSystemView.java:294)
at 
java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
at 
org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:281)}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HUDI-2946) Upgrade maven plugin to make Hudi be compatible with higher Java versions

2021-12-06 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-2946:
--

 Summary: Upgrade maven plugin to make Hudi be compatible with 
higher Java versions
 Key: HUDI-2946
 URL: https://issues.apache.org/jira/browse/HUDI-2946
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Wenning Ding


I saw several issues while building Hudi w/ Java 11:

 

{{[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-jar-plugin:2.6:test-jar (default) on project 
hudi-common: Execution default of goal 
org.apache.maven.plugins:maven-jar-plugin:2.6:test-jar failed: An API 
incompatibility was encountered while executing 
org.apache.maven.plugins:maven-jar-plugin:2.6:test-jar: 
java.lang.ExceptionInInitializerError: null[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-shade-plugin:3.1.1:shade (default) on project 
hudi-hadoop-mr-bundle: Error creating shaded jar: Problem shading JAR 
/workspace/workspace/rchertar.bigtop.hudi-rpm-mainline-6.x-0.9.0/build/hudi/rpm/BUILD/hudi-0.9.0-amzn-1-SNAPSHOT/packaging/hudi-hadoop-mr-bundle/target/hudi-hadoop-mr-bundle-0.9.0-amzn-1-SNAPSHOT.jar
 entry org/apache/hudi/hadoop/bundle/Main.class: 
java.lang.IllegalArgumentException -> [Help 1]}}

 

We need to upgrade maven plugin versions to make it be compatible with Java 11.

Also upgrade dockerfile-maven-plugin to latest versions to support Java 11 
[https://github.com/spotify/dockerfile-maven/pull/230]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HUDI-2884) Allow loading external configs while querying Hudi tables with Spark

2021-11-29 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-2884:
--

 Summary: Allow loading external configs while querying Hudi tables 
with Spark
 Key: HUDI-2884
 URL: https://issues.apache.org/jira/browse/HUDI-2884
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Wenning Ding


Currently when doing Hudi queries w/ Spark, it won't load the external 
configurations. Say if customers enabled metadata listing in their global 
config file, then this would let them actually query w/o metadata feature 
enabled. This CR fixes this issue and allows loading global configs during the 
Hudi reading phase.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HUDI-2775) Add documentation for external configuration support

2021-11-16 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-2775:
--

 Summary: Add documentation for external configuration support
 Key: HUDI-2775
 URL: https://issues.apache.org/jira/browse/HUDI-2775
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Wenning Ding


We support external configuration in this pr: 
[https://github.com/apache/hudi/pull/3486,] but we need more documentation to 
let the customers know how to use it



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2364) Run compaction without user schema file provided

2021-08-25 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-2364:
---
Description: 
Currently to run Hudi compaction manually, customers have to pass the avsc file 
of data schema by themselves,
 e.g. in Hudi CLI,

 

{{}}
{code:java}
compaction run --compactionInstant 20201203005420 \ --parallelism 2 
--sparkMemory 2G \ --schemaFilePath s3://xxx/hudi/mor_schema.avsc \ 
--propsFilePath file:///home/hadoop/config.properties --retry 1
{code}
Let customers provide avsc file is not a good option. Some customers don’t know 
how to generate this schema file, and some customers pass the wrong schema file 
and get other exceptions. We should handle this logic inside Hudi if possible.

  was:
Currently to run Hudi compaction manually, customers have to pass the avsc file 
of data schema by themselves,
e.g. in Hudi CLI,

 

{{}}
{code:java}
compaction run --compactionInstant 20201203005420 \ --parallelism 2 
--sparkMemory 2G \ --schemaFilePath 
s3://wenningd-emr-dev/oncall/hudi/mor_delete_2_schema.avsc \ --propsFilePath 
file:///home/hadoop/config.properties --retry 1
{code}
Let customers provide avsc file is not a good option. Some customers don’t know 
how to generate this schema file, and some customers pass the wrong schema file 
and get other exceptions. We should handle this logic inside Hudi if possible.


> Run compaction without user schema file provided
> 
>
> Key: HUDI-2364
> URL: https://issues.apache.org/jira/browse/HUDI-2364
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Wenning Ding
>Priority: Major
>
> Currently to run Hudi compaction manually, customers have to pass the avsc 
> file of data schema by themselves,
>  e.g. in Hudi CLI,
>  
> {{}}
> {code:java}
> compaction run --compactionInstant 20201203005420 \ --parallelism 2 
> --sparkMemory 2G \ --schemaFilePath s3://xxx/hudi/mor_schema.avsc \ 
> --propsFilePath file:///home/hadoop/config.properties --retry 1
> {code}
> Let customers provide avsc file is not a good option. Some customers don’t 
> know how to generate this schema file, and some customers pass the wrong 
> schema file and get other exceptions. We should handle this logic inside Hudi 
> if possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-2364) Run compaction without user schema file provided

2021-08-25 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-2364:
--

 Summary: Run compaction without user schema file provided
 Key: HUDI-2364
 URL: https://issues.apache.org/jira/browse/HUDI-2364
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Wenning Ding


Currently to run Hudi compaction manually, customers have to pass the avsc file 
of data schema by themselves,
e.g. in Hudi CLI,

 

{{}}
{code:java}
compaction run --compactionInstant 20201203005420 \ --parallelism 2 
--sparkMemory 2G \ --schemaFilePath 
s3://wenningd-emr-dev/oncall/hudi/mor_delete_2_schema.avsc \ --propsFilePath 
file:///home/hadoop/config.properties --retry 1
{code}
Let customers provide avsc file is not a good option. Some customers don’t know 
how to generate this schema file, and some customers pass the wrong schema file 
and get other exceptions. We should handle this logic inside Hudi if possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-2362) Hudi external configuration file support

2021-08-25 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-2362:
--

 Summary: Hudi external configuration file support
 Key: HUDI-2362
 URL: https://issues.apache.org/jira/browse/HUDI-2362
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Wenning Ding


Many big data applications like Hadoop, Hive have an XML configuration file 
that users can have a concentrated place to set all the parameters.

Also to support Spark SQL, it might be easier for Hudi to have a configuration 
file which could avoid setting Hudi parameters inside Hive CLI or Spark SQL CLI.

Here is an example:

{{}}
{code:java}
# Enable optimistic concurrency control by default, to disable it, remove the 
following two configs
hoodie.write.concurrency.mode   optimistic_concurrency_control
hoodie.cleaner.policy.failed.writes LAZY

hoodie.write.lock.provider  
org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
hoodie.write.lock.zookeeper.url ip-192-168-1-239.ec2.internal
hoodie.write.lock.zookeeper.port2181
hoodie.write.lock.zookeeper.base_path   hudi_occ_lock

hoodie.index.type   BLOOM
# Only applies if index type is HBASE
hoodie.index.hbase.zkquorum ip-192-168-1-239.ec2.internal
hoodie.index.hbase.zkport   2181

# Only applies if Hive sync is enable
hoodie.datasource.hive_sync.jdbcurl 
jdbc:hive2://ip-192-168-1-239.ec2.internal:1
{code}
{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-2314) Add DynamoDb based lock provider

2021-08-16 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-2314:
--

 Summary: Add DynamoDb based lock provider
 Key: HUDI-2314
 URL: https://issues.apache.org/jira/browse/HUDI-2314
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Wenning Ding


Similar to zookeeper & hive metastore based lock provider, we need to add 
DynamoDb based lock provider. The benefit of having DynamoDb based lock 
provider is for the customers who use AWS EMR, they can share the lock 
information across different EMR clusters and it's more easy for them to config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-2255) Refactor DataSourceOptions

2021-07-29 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-2255:
--

 Summary: Refactor DataSourceOptions
 Key: HUDI-2255
 URL: https://issues.apache.org/jira/browse/HUDI-2255
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Wenning Ding


As discussed with Vinoth, we can rename DataSourceOptions, from xxx_OPT_KEY to 
xxx_OPT.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-2242) Add inference logic to few Hudi configs

2021-07-28 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding reassigned HUDI-2242:
--

Assignee: Wenning Ding

> Add inference logic to few Hudi configs
> ---
>
> Key: HUDI-2242
> URL: https://issues.apache.org/jira/browse/HUDI-2242
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Major
>
> Since we already refactored the Hudi config framework, we can add inference 
> logic to few Hudi configs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-2242) Add inference logic to few Hudi configs

2021-07-28 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-2242:
--

 Summary: Add inference logic to few Hudi configs
 Key: HUDI-2242
 URL: https://issues.apache.org/jira/browse/HUDI-2242
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Wenning Ding


Since we already refactored the Hudi config framework, we can add inference 
logic to few Hudi configs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-2146) Concurrent writes loss data

2021-07-08 Thread Wenning Ding (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377525#comment-17377525
 ] 

Wenning Ding edited comment on HUDI-2146 at 7/8/21, 5:37 PM:
-

[~nishith29] [~vinoth] thanks for taking a look.

`df3` is different for insert1 and insert2 with different primary keys.

insert1:
 (400, "event_name_11", "2125-02-01T13:51:39.340396Z", "type1"),
 (401, "event_name_22", "2125-02-01T12:14:58.597216Z", "type2"),
 (404, "event_name_333433", "2126-01-01T12:15:00.512679Z", "type1"),
 (405, "event_name_666378", "2125-07-01T13:51:42.248818Z", "type2")
 ).toDF("event_id", "event_name", "event_ts", "event_type")
 insert2:
 val df3 = Seq(
 (300, "event_name_1", "2035-02-01T13:51:39.340396Z", "type1"),
 (301, "event_name_2", "2035-02-01T12:14:58.597216Z", "type2"),
 (304, "event_name_3", "2036-01-01T12:15:00.512679Z", "type1"),
 (305, "event_name_66678", "2035-07-01T13:51:42.248818Z", "type2")
 ).toDF("event_id", "event_name", "event_ts", "event_type")

The expected output (this is the content of parquet file at 20210706171250):

 
{code:java}
20210706170824,20210706170824_1_1,104,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-5-16_20210706170824.parquet,104,event_name_123,2015-01-01T12:15:00.512679Z,type1
20210706170824,20210706170824_1_2,100,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-5-16_20210706170824.parquet,100,event_name_16,2015-01-01T13:51:39.340396Z,type1
20210706171252,20210706171252_1_1,300,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-21-12012_20210706171252.parquet,300,event_name_1,2035-02-01T13:51:39.340396Z,type1
20210706171252,20210706171252_1_2,304,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-21-12012_20210706171252.parquet,304,event_name_3,2036-01-01T12:15:00.512679Z,type1
20210706171250,20210706171250_1_1,400,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-32-12035_20210706171250.parquet,400,event_name_11,2125-02-01T13:51:39.340396Z,type1
20210706171250,20210706171250_1_2,404,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-32-12035_20210706171250.parquet,404,event_name_333433,2126-01-01T12:15:00.512679Z,type1
{code}
 

The output I saw (this is the content of parquet file at 20210706171252):
{code:java}
20210706170824,20210706170824_1_1,104,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-5-16_20210706170824.parquet,104,event_name_123,2015-01-01T12:15:00.512679Z,type1
20210706170824,20210706170824_1_2,100,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-5-16_20210706170824.parquet,100,event_name_16,2015-01-01T13:51:39.340396Z,type1
20210706171252,20210706171252_1_1,300,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-21-12012_20210706171252.parquet,300,event_name_1,2035-02-01T13:51:39.340396Z,type1
20210706171252,20210706171252_1_2,304,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-21-12012_20210706171252.parquet,304,event_name_3,2036-01-01T12:15:00.512679Z,type1
{code}
I started two spark-shell in the same cluster.


was (Author: wenningd):
[~nishith29] [~vinoth] thanks for taking a look.

`df3` is different for insert1 and insert2 with different primary keys.

insert1:
  (400, "event_name_11", "2125-02-01T13:51:39.340396Z", "type1"),
  (401, "event_name_22", "2125-02-01T12:14:58.597216Z", "type2"),
  (404, "event_name_333433", "2126-01-01T12:15:00.512679Z", "type1"),
  (405, "event_name_666378", "2125-07-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")
insert2:
val df3 = Seq(
  (300, "event_name_1", "2035-02-01T13:51:39.340396Z", "type1"),
  (301, "event_name_2", "2035-02-01T12:14:58.597216Z", "type2"),
  (304, "event_name_3", "2036-01-01T12:15:00.512679Z", "type1"),
  (305, "event_name_66678", "2035-07-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

The expected output:

 
{code:java}
20210706170824,20210706170824_1_1,104,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-5-16_20210706170824.parquet,104,event_name_123,2015-01-01T12:15:00.512679Z,type1
20210706170824,20210706170824_1_2,100,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-5-16_20210706170824.parquet,100,event_name_16,2015-01-01T13:51:39.340396Z,type1
20210706171252,20210706171252_1_1,300,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-21-12012_20210706171252.parquet,300,event_name_1,2035-02-01T13:51:39.340396Z,type1
20210706171252,20210706171252_1_2,304,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-21-12012_20210706171252.parquet,304,event_name_3,2036-01-01T12:15:00.512679Z,type1
20210706171250,20210706171250_1_1,400,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-32-12035_20210706171250.parquet,400,event_name_11,2125-02-01T13:51:39.340396Z,type1
20210706171250,20210706171250_1_2,404,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-32-12035_20210706171250.parquet,404,event_name_333433,2126-01-01T12:15:00.512679Z,type1
{code}
 

The output

[jira] [Commented] (HUDI-2146) Concurrent writes loss data

2021-07-08 Thread Wenning Ding (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377525#comment-17377525
 ] 

Wenning Ding commented on HUDI-2146:


[~nishith29] [~vinoth] thanks for taking a look.

`df3` is different for insert1 and insert2 with different primary keys.

insert1:
  (400, "event_name_11", "2125-02-01T13:51:39.340396Z", "type1"),
  (401, "event_name_22", "2125-02-01T12:14:58.597216Z", "type2"),
  (404, "event_name_333433", "2126-01-01T12:15:00.512679Z", "type1"),
  (405, "event_name_666378", "2125-07-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")
insert2:
val df3 = Seq(
  (300, "event_name_1", "2035-02-01T13:51:39.340396Z", "type1"),
  (301, "event_name_2", "2035-02-01T12:14:58.597216Z", "type2"),
  (304, "event_name_3", "2036-01-01T12:15:00.512679Z", "type1"),
  (305, "event_name_66678", "2035-07-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

The expected output:

 
{code:java}
20210706170824,20210706170824_1_1,104,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-5-16_20210706170824.parquet,104,event_name_123,2015-01-01T12:15:00.512679Z,type1
20210706170824,20210706170824_1_2,100,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-5-16_20210706170824.parquet,100,event_name_16,2015-01-01T13:51:39.340396Z,type1
20210706171252,20210706171252_1_1,300,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-21-12012_20210706171252.parquet,300,event_name_1,2035-02-01T13:51:39.340396Z,type1
20210706171252,20210706171252_1_2,304,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-21-12012_20210706171252.parquet,304,event_name_3,2036-01-01T12:15:00.512679Z,type1
20210706171250,20210706171250_1_1,400,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-32-12035_20210706171250.parquet,400,event_name_11,2125-02-01T13:51:39.340396Z,type1
20210706171250,20210706171250_1_2,404,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-32-12035_20210706171250.parquet,404,event_name_333433,2126-01-01T12:15:00.512679Z,type1
{code}
 

The output I saw:
{code:java}
20210706170824,20210706170824_1_1,104,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-5-16_20210706170824.parquet,104,event_name_123,2015-01-01T12:15:00.512679Z,type1
20210706170824,20210706170824_1_2,100,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-5-16_20210706170824.parquet,100,event_name_16,2015-01-01T13:51:39.340396Z,type1
20210706171252,20210706171252_1_1,300,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-21-12012_20210706171252.parquet,300,event_name_1,2035-02-01T13:51:39.340396Z,type1
20210706171252,20210706171252_1_2,304,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-21-12012_20210706171252.parquet,304,event_name_3,2036-01-01T12:15:00.512679Z,type1
{code}
I started two spark-shell in the same cluster.

> Concurrent writes loss data 
> 
>
> Key: HUDI-2146
> URL: https://issues.apache.org/jira/browse/HUDI-2146
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Blocker
> Fix For: 0.9.0
>
> Attachments: image-2021-07-08-00-49-30-730.png
>
>
> Reproduction steps:
> Create a Hudi table:
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.spark.sql.SaveMode
> import org.apache.hudi.AvroConversionUtils
> val df = Seq(
>   (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"),
>   (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
>   (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
>   (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> var tableName = "hudi_test"
> var tablePath = "s3://.../" + tableName
> // write hudi dataset
> df.write.format("org.apache.hudi")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Overwrite)
>   .save(tablePath)
> {code}
> Perform two insert operations almost in the same time, each insertion 
> c

[jira] [Created] (HUDI-2146) Concurrent writes loss data

2021-07-08 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-2146:
--

 Summary: Concurrent writes loss data 
 Key: HUDI-2146
 URL: https://issues.apache.org/jira/browse/HUDI-2146
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Wenning Ding
 Attachments: image-2021-07-08-00-49-30-730.png

Reproduction steps:

Create a Hudi table:
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.spark.sql.SaveMode
import org.apache.hudi.AvroConversionUtils

val df = Seq(
  (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
  (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

var tableName = "hudi_test"
var tablePath = "s3://.../" + tableName

// write hudi dataset
df.write.format("org.apache.hudi")
  .option(HoodieWriteConfig.TABLE_NAME, tableName)
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
  .mode(SaveMode.Overwrite)
  .save(tablePath)
{code}
Perform two insert operations almost in the same time, each insertion contains 
different data:

Insert 1:
{code:java}
val df3 = Seq(
  (400, "event_name_11", "2125-02-01T13:51:39.340396Z", "type1"),
  (401, "event_name_22", "2125-02-01T12:14:58.597216Z", "type2"),
  (404, "event_name_333433", "2126-01-01T12:15:00.512679Z", "type1"),
  (405, "event_name_666378", "2125-07-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

// update hudi dataset
df3.write.format("org.apache.hudi")
   .option(HoodieWriteConfig.TABLE_NAME, tableName)
   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
   .option("hoodie.write.concurrency.mode", "optimistic_concurrency_control")
   .option("hoodie.cleaner.policy.failed.writes", "LAZY")
   .option("hoodie.write.lock.provider", 
"org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider")
   .option("hoodie.write.lock.zookeeper.url", "ip-***.ec2.internal")
   .option("hoodie.write.lock.zookeeper.port", "2181")
   .option("hoodie.write.lock.zookeeper.lock_key", tableName)
   .option("hoodie.write.lock.zookeeper.base_path", "/occ_lock")
   .mode(SaveMode.Append)
   .save(tablePath)
{code}
Insert 2:
{code:java}
val df3 = Seq(
  (300, "event_name_1", "2035-02-01T13:51:39.340396Z", "type1"),
  (301, "event_name_2", "2035-02-01T12:14:58.597216Z", "type2"),
  (304, "event_name_3", "2036-01-01T12:15:00.512679Z", "type1"),
  (305, "event_name_66678", "2035-07-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

// update hudi dataset
df3.write.format("org.apache.hudi")
   .option(HoodieWriteConfig.TABLE_NAME, tableName)
   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
   .option(DataSourceWriteOpti

[jira] [Created] (HUDI-1899) Auto sync Hudi configuration description with website

2021-05-12 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-1899:
--

 Summary: Auto sync Hudi configuration description with website
 Key: HUDI-1899
 URL: https://issues.apache.org/jira/browse/HUDI-1899
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Wenning Ding


Based on the change in https://issues.apache.org/jira/browse/HUDI-89, we can 
further add a feature to auto sync Hudi configuration description with website.

Flink has similar implementation: 
https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/description/Description.java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-874) Schema evolution does not work with AWS Glue catalog

2021-05-11 Thread Wenning Ding (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342805#comment-17342805
 ] 

Wenning Ding edited comment on HUDI-874 at 5/11/21, 6:52 PM:
-

Would be good if you can share some reproduction steps.

Here is what I tried on EMR 6.1.0:
 # Created a Hudi table with 4 columns.
 # Appended a new column at the end (5 columns totally), upsert Hudi table.

I didn't run into any error in this case.


was (Author: wenningd):
Would be good if you can share some reproduction steps.

Here is what I tried on EMR 6.1.0:
 # Created a Hudi table with 4 columns.
 # Append a new column at the end (5 columns totally), upsert Hudi table.

I didn't run into any error in this case.

> Schema evolution does not work with AWS Glue catalog
> 
>
> Key: HUDI-874
> URL: https://issues.apache.org/jira/browse/HUDI-874
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Major
>  Labels: aws-emr, sev:critical, user-support-issues
>
> This issue has been discussed here 
> [https://github.com/apache/incubator-hudi/issues/1581] and at other places as 
> well. Glue catalog currently does not support *cascade* for *ALTER TABLE* 
> statements. As a result features like adding new columns to an existing table 
> does now work with glue catalog .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-874) Schema evolution does not work with AWS Glue catalog

2021-05-11 Thread Wenning Ding (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342805#comment-17342805
 ] 

Wenning Ding edited comment on HUDI-874 at 5/11/21, 6:52 PM:
-

Would be good if you can share some reproduction steps.

Here is what I tried on EMR 6.1.0:
 # Created a Hudi table with 4 columns.
 # Append a new column at the end (5 columns totally), upsert Hudi table.

I didn't run into any error in this case.


was (Author: wenningd):
Can you share some reproduction steps.

Here is what I tried on EMR 6.1.0:
 # Created a Hudi table with 4 columns.
 # Append a new column at the end (5 columns totally), upsert Hudi table.

> Schema evolution does not work with AWS Glue catalog
> 
>
> Key: HUDI-874
> URL: https://issues.apache.org/jira/browse/HUDI-874
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Major
>  Labels: aws-emr, sev:critical, user-support-issues
>
> This issue has been discussed here 
> [https://github.com/apache/incubator-hudi/issues/1581] and at other places as 
> well. Glue catalog currently does not support *cascade* for *ALTER TABLE* 
> statements. As a result features like adding new columns to an existing table 
> does now work with glue catalog .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-874) Schema evolution does not work with AWS Glue catalog

2021-05-11 Thread Wenning Ding (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342805#comment-17342805
 ] 

Wenning Ding commented on HUDI-874:
---

Can you share some reproduction steps.

Here is what I tried on EMR 6.1.0:
 # Created a Hudi table with 4 columns.
 # Append a new column at the end (5 columns totally), upsert Hudi table.

> Schema evolution does not work with AWS Glue catalog
> 
>
> Key: HUDI-874
> URL: https://issues.apache.org/jira/browse/HUDI-874
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Major
>  Labels: aws-emr, sev:critical, user-support-issues
>
> This issue has been discussed here 
> [https://github.com/apache/incubator-hudi/issues/1581] and at other places as 
> well. Glue catalog currently does not support *cascade* for *ALTER TABLE* 
> statements. As a result features like adding new columns to an existing table 
> does now work with glue catalog .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1627) DeltaStreamer does not work with SQLtransformation

2021-02-18 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-1627:
--

 Summary: DeltaStreamer does not work with SQLtransformation 
 Key: HUDI-1627
 URL: https://issues.apache.org/jira/browse/HUDI-1627
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Wenning Ding


Hudi deltastreamer ,  using SQLtransformation :

21/02/04 17:19:50 INFO SqlQueryBasedTransformer: SQL Query for transformation : 
(select order_date from `default`.ordertest)
org.apache.spark.sql.AnalysisException: Table or view not found: 
`default`.`ordertest`; line 1 pos 23;
’Project [’order_date]
+- ’UnresolvedRelation `default`.`ordertest` 

 

This is because:
DeltaStreamer does not enable HiveSupport when initiating the SparkSession.
this.sparkSession = SparkSession.builder().config(jssc.getConf()).getOrCreate();
line 526 in 
[https://github.com/a0x8o/hudi/blob/b8d0747959bc6f101b5b90b8e3ad323aafa2aa6e/hudi-u[…]rg/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java|https://github.com/a0x8o/hudi/blob/b8d0747959bc6f101b5b90b8e3ad323aafa2aa6e/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1240) Simplify config classes

2021-02-09 Thread Wenning Ding (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282006#comment-17282006
 ] 

Wenning Ding commented on HUDI-1240:


Make sense to me! This is similar to what I thought. 

Just have a question, if we start using the ConfigOption structure, do we still 
keep the existing configuration files?

If we remove the existing configurations files, then users won't be able to use 
something like this any more?
{code:java}
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "id"){code}

> Simplify config classes
> ---
>
> Key: HUDI-1240
> URL: https://issues.apache.org/jira/browse/HUDI-1240
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: sivabalan narayanan
>Priority: Major
>
> Cleanup config classes across the board with a {{HoodieConfig}}  class in 
> hudi-common, that wraps a single key,value, default, doc, fallback keys (old 
> keys this points to. eg).
> the notion of fallback keys is very important, such that the user can still 
> use the old config names and we should be able map it to how we have now 
> renamed. 
> i.e say there was `hoodie.a.b.c`  and we think `hoodie.x.y.z` is a better 
> name, then `hoodie.a.b.c` should be marked as deprecated (annotation or a 
> boolean in HoodieConfig) and list `hoodie.x.y.z` as the fallback key. Users 
> should be able to set `hoodie.a.b.c` for the next 3-6 months at least and 
> internall we translate it to `hoodie.x.y.z`. None of our code should directly 
> access on `hoodie.a.b.c` anymore.  
> We can see the Apache Flink project fpr examples. 
> Once this is done, first pass we should move all our existing configs to 
> something like below. 
> {code:java}
>   public static final String EMBEDDED_TIMELINE_SERVER_ENABLED = 
> "hoodie.embed.timeline.server";
>   public static final String DEFAULT_EMBEDDED_TIMELINE_SERVER_ENABLED = 
> "true";
> {code}
> becomes 
> {code:java}
>public static HoodieConfig timelineServerEnabled = new HoodieConfig(
>  "hoodie.embed.timeline.server", 
> // property name
>   Boolean.class, // type
>   true, //default val
>   Option.empty(), // fallback key
>   false, //deprecated 
>   
>   "Enables/Disables the timeline 
> server on the write client.." //doc
> )
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-89) Clean up placement, naming, defaults of HoodieWriteConfig

2021-02-06 Thread Wenning Ding (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280356#comment-17280356
 ] 

Wenning Ding commented on HUDI-89:
--

Just have a question [~shivnarayan] [~vinoth], if we start using the 
ConfigOption structure, then users won't be able to use something like this any 
more?
{code:java}
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "id")
{code}

> Clean up placement, naming, defaults of HoodieWriteConfig
> -
>
> Key: HUDI-89
> URL: https://issues.apache.org/jira/browse/HUDI-89
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup, Usability, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>
> # Rename HoodieWriteConfig to HoodieClientConfig 
>  # Move bunch of configs from  CompactionConfig to StorageConfig 
>  # Introduce new HoodieCleanConfig
>  # Should we consider lombok or something to automate the 
> defaults/getters/setters
>  # Consistent name of properties/defaults 
>  # Enforce bounds more strictly 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1512) Fix hudi-spark2 unit tests failure with Spark 3.0.0

2021-01-06 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1512:
---
Description: 
This bug is introduced by https://github.com/apache/hudi/pull/2328.

 

hudi-spark2 unit tests failed when running with Spark 3.0.0:
{code:java}
mvn clean install -Dspark3 -pl hudi-spark-datasource/hudi-spark2{code}
Threw a class not found error.
{code:java}
java.lang.NoClassDefFoundError: org/spark_project/guava/cache/CacheBuilder
at 
org.apache.hudi.internal.TestHoodieBulkInsertDataInternalWriter.testGlobalFailure(TestHoodieBulkInsertDataInternalWriter.java:97)
{code}
When enabling spark3 profile, in hudi-spark2 module, some of the test code is 
compiled with Spark 3 and then run with Spark 2. The above error is also 
because of this incompatibility.

To solve this compatibility issue is kind of hard, but we can skip hudi-spark2 
unit testing when enable spark3 profile since hudi-spark2 code is actually not 
used in the Spark 3.0.0 environment.

 

  was:
hudi-spark2 unit tests failed when running with Spark 3.0.0:
{code:java}
mvn clean install -Dspark3 -pl hudi-spark-datasource/hudi-spark2{code}
Threw a class not found error.
{code:java}
java.lang.NoClassDefFoundError: org/spark_project/guava/cache/CacheBuilder
at 
org.apache.hudi.internal.TestHoodieBulkInsertDataInternalWriter.testGlobalFailure(TestHoodieBulkInsertDataInternalWriter.java:97)
{code}
When enabling spark3 profile, in hudi-spark2 module, some of the test code is 
compiled with Spark 3 and run with Spark 2. The above error is also because of 
this incompatibility.

To solve this compatibility issue is kind of hard, but we can skip hudi-spark2 
unit testing when enable spark3 profile since hudi-spark2 code is actually not 
used in the Spark 3.0.0 environment.

 


> Fix hudi-spark2 unit tests failure with Spark 3.0.0 
> 
>
> Key: HUDI-1512
> URL: https://issues.apache.org/jira/browse/HUDI-1512
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Major
>
> This bug is introduced by https://github.com/apache/hudi/pull/2328.
>  
> hudi-spark2 unit tests failed when running with Spark 3.0.0:
> {code:java}
> mvn clean install -Dspark3 -pl hudi-spark-datasource/hudi-spark2{code}
> Threw a class not found error.
> {code:java}
> java.lang.NoClassDefFoundError: org/spark_project/guava/cache/CacheBuilder
>   at 
> org.apache.hudi.internal.TestHoodieBulkInsertDataInternalWriter.testGlobalFailure(TestHoodieBulkInsertDataInternalWriter.java:97)
> {code}
> When enabling spark3 profile, in hudi-spark2 module, some of the test code is 
> compiled with Spark 3 and then run with Spark 2. The above error is also 
> because of this incompatibility.
> To solve this compatibility issue is kind of hard, but we can skip 
> hudi-spark2 unit testing when enable spark3 profile since hudi-spark2 code is 
> actually not used in the Spark 3.0.0 environment.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1512) Fix hudi-spark2 unit tests failure with Spark 3.0.0

2021-01-06 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1512:
---
Description: 
hudi-spark2 unit tests failed when running with Spark 3.0.0:
{code:java}
mvn clean install -Dspark3 -pl hudi-spark-datasource/hudi-spark2{code}
Threw a class not found error.
{code:java}
java.lang.NoClassDefFoundError: org/spark_project/guava/cache/CacheBuilder
at 
org.apache.hudi.internal.TestHoodieBulkInsertDataInternalWriter.testGlobalFailure(TestHoodieBulkInsertDataInternalWriter.java:97)
{code}
When enabling spark3 profile, in hudi-spark2 module, some of the test code is 
compiled with Spark 3 and run with Spark 2. The above error is also because of 
this incompatibility.

To solve this compatibility issue is kind of hard, but we can skip hudi-spark2 
unit testing when enable spark3 profile since hudi-spark2 code is actually not 
used in the Spark 3.0.0 environment.

 

  was:
hudi-spark2 unit tests failed with Spark 3.0.0:
{code:java}
mvn clean install -Dspark3 -pl hudi-spark-datasource/hudi-spark2

java.lang.NoClassDefFoundError: org/spark_project/guava/cache/CacheBuilder
at 
org.apache.hudi.internal.TestHoodieBulkInsertDataInternalWriter.testGlobalFailure(TestHoodieBulkInsertDataInternalWriter.java:97)
{code}


> Fix hudi-spark2 unit tests failure with Spark 3.0.0 
> 
>
> Key: HUDI-1512
> URL: https://issues.apache.org/jira/browse/HUDI-1512
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>
> hudi-spark2 unit tests failed when running with Spark 3.0.0:
> {code:java}
> mvn clean install -Dspark3 -pl hudi-spark-datasource/hudi-spark2{code}
> Threw a class not found error.
> {code:java}
> java.lang.NoClassDefFoundError: org/spark_project/guava/cache/CacheBuilder
>   at 
> org.apache.hudi.internal.TestHoodieBulkInsertDataInternalWriter.testGlobalFailure(TestHoodieBulkInsertDataInternalWriter.java:97)
> {code}
> When enabling spark3 profile, in hudi-spark2 module, some of the test code is 
> compiled with Spark 3 and run with Spark 2. The above error is also because 
> of this incompatibility.
> To solve this compatibility issue is kind of hard, but we can skip 
> hudi-spark2 unit testing when enable spark3 profile since hudi-spark2 code is 
> actually not used in the Spark 3.0.0 environment.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-1512) Fix hudi-spark2 unit tests failure with Spark 3.0.0

2021-01-06 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding reassigned HUDI-1512:
--

Assignee: Wenning Ding

> Fix hudi-spark2 unit tests failure with Spark 3.0.0 
> 
>
> Key: HUDI-1512
> URL: https://issues.apache.org/jira/browse/HUDI-1512
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Major
>
> hudi-spark2 unit tests failed when running with Spark 3.0.0:
> {code:java}
> mvn clean install -Dspark3 -pl hudi-spark-datasource/hudi-spark2{code}
> Threw a class not found error.
> {code:java}
> java.lang.NoClassDefFoundError: org/spark_project/guava/cache/CacheBuilder
>   at 
> org.apache.hudi.internal.TestHoodieBulkInsertDataInternalWriter.testGlobalFailure(TestHoodieBulkInsertDataInternalWriter.java:97)
> {code}
> When enabling spark3 profile, in hudi-spark2 module, some of the test code is 
> compiled with Spark 3 and run with Spark 2. The above error is also because 
> of this incompatibility.
> To solve this compatibility issue is kind of hard, but we can skip 
> hudi-spark2 unit testing when enable spark3 profile since hudi-spark2 code is 
> actually not used in the Spark 3.0.0 environment.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1512) Fix hudi-spark2 unit tests failure with Spark 3.0.0

2021-01-06 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-1512:
--

 Summary: Fix hudi-spark2 unit tests failure with Spark 3.0.0 
 Key: HUDI-1512
 URL: https://issues.apache.org/jira/browse/HUDI-1512
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Wenning Ding


hudi-spark2 unit tests failed with Spark 3.0.0:
{code:java}
mvn clean install -Dspark3 -pl hudi-spark-datasource/hudi-spark2

java.lang.NoClassDefFoundError: org/spark_project/guava/cache/CacheBuilder
at 
org.apache.hudi.internal.TestHoodieBulkInsertDataInternalWriter.testGlobalFailure(TestHoodieBulkInsertDataInternalWriter.java:97)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1489) Not able to read after updating bootstrap table with written table

2020-12-22 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1489:
---
Description: 
After updating Hudi table with the written bootstrap table, it would fail to 
read the latest bootstrap table.
h3. Reproduction steps
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.common.model.HoodieTableType
import org.apache.hudi.config.HoodieBootstrapConfig
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SparkSession

val bucketName = "wenningd-dev"
val tableName = "hudi_bootstrap_test_cow_5c1a5147_888e_4b638bef8"
val recordKeyName = "event_id"
val partitionKeyName = "event_type"
val precombineKeyName = "event_time"
val verificationRecordKey = "4"
val verificationColumn = "event_name"
val originalVerificationValue = "event_d"
val updatedVerificationValue = "event_test"
val sourceTableLocation = "s3://wenningd-dev/hudi/test-data/source_table/"

val tableType = HoodieTableType.COPY_ON_WRITE.name()
val verificationSqlQuery = "select " + verificationColumn + " from " + 
tableName + " where " + recordKeyName + " = '" + verificationRecordKey + "'"
val tablePath = "s3://" + bucketName + "/hudi/tables/" + tableName
val loadTablePath = tablePath + "/*/*"

// Create table and sync with hive

val df = spark.emptyDataFrame
val tableType = HoodieTableType.COPY_ON_WRITE.name

df.write
  .format("hudi")
  .option(HoodieWriteConfig.TABLE_NAME, tableName)
  .option(HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP, 
sourceTableLocation)
  .option(HoodieBootstrapConfig.BOOTSTRAP_KEYGEN_CLASS, 
"org.apache.hudi.keygen.SimpleKeyGenerator")
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, tableType)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, recordKeyName)
  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
partitionKeyName)
  
.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
  .mode(SaveMode.Overwrite)
  .save(tablePath)

// Verify create with spark sql query
val result0 = spark.sql(verificationSqlQuery)
if (!(result0.count == 1) || 
!result0.collect.mkString.contains(originalVerificationValue)) {
  throw new TestFailureException("Create table verification failed!")
}


val df3 = spark.read.format("org.apache.hudi").load(loadTablePath)
val df4 = df3.filter(col(recordKeyName) === verificationRecordKey)
val df5 = df4.withColumn(verificationColumn, lit(updatedVerificationValue))
df5.write.format("org.apache.hudi")
  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, tableType)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, recordKeyName)
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, partitionKeyName)
  .option(HoodieWriteConfig.TABLE_NAME, tableName)
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, precombineKeyName)
  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
partitionKeyName)
  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
  .mode(SaveMode.Append)
  .save(tablePath)

  val result1 = spark.sql(verificationSqlQuery)

  val df6 = spark.read.format("org.apache.hudi").load(loadTablePath)

df6.show
{code}
df6.show would return:
{code:java}
Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2043)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2031)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2030)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2030)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:967)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:967)
  at scala.Option.foreach(Option.scala:257)
  at 
org.apache.spark.scheduler.DAGSchedul

[jira] [Created] (HUDI-1489) Not able to read after updating bootstrap table with written table

2020-12-22 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-1489:
--

 Summary: Not able to read after updating bootstrap table with 
written table
 Key: HUDI-1489
 URL: https://issues.apache.org/jira/browse/HUDI-1489
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Wenning Ding


After updating Hudi table with the written bootstrap table, it would fail to 
read the latest bootstrap table.
h3. Reproduction steps
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.common.model.HoodieTableType
import org.apache.hudi.config.HoodieBootstrapConfig
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SparkSession

val bucketName = "wenningd-emr-dev"
val tableName = "hudi_bootstrap_test_cow_5c1a5147_888e_4b638bef8"
val recordKeyName = "event_id"
val partitionKeyName = "event_type"
val precombineKeyName = "event_time"
val verificationRecordKey = "4"
val verificationColumn = "event_name"
val originalVerificationValue = "event_d"
val updatedVerificationValue = "event_test"
// val sourceTableWithoutHiveStylePartition = 
"s3://wenningd-emr-dev/hudi/test-data/source_table/"

// new parameters
val sourceTableLocation = 
"s3://wenningd-emr-dev/hudi/test-data/source_table/"

val tableType = HoodieTableType.COPY_ON_WRITE.name()
val verificationSqlQuery = "select " + verificationColumn + " from " + 
tableName + " where " + recordKeyName + " = '" + verificationRecordKey + "'"
val tablePath = "s3://" + bucketName + "/hudi/tables/" + tableName
val loadTablePath = tablePath + "/*/*"

// Create table and sync with hive

val df = spark.emptyDataFrame
val tableType = HoodieTableType.COPY_ON_WRITE.name
// val tableType = HoodieTableType.MERGE_ON_READ.name

df.write
  .format("hudi")
  .option(HoodieWriteConfig.TABLE_NAME, tableName)
  .option(HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP, 
sourceTableLocation)
  .option(HoodieBootstrapConfig.BOOTSTRAP_KEYGEN_CLASS, 
"org.apache.hudi.keygen.SimpleKeyGenerator")
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, tableType)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, recordKeyName)
  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
partitionKeyName)
  
.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
  .mode(SaveMode.Overwrite)
  .save(tablePath)

// Verify create with spark sql query
val result0 = spark.sql(verificationSqlQuery)
if (!(result0.count == 1) || 
!result0.collect.mkString.contains(originalVerificationValue)) {
  throw new TestFailureException("Create table verification failed!")
}


val df3 = spark.read.format("org.apache.hudi").load(loadTablePath)
val df4 = df3.filter(col(recordKeyName) === verificationRecordKey)
val df5 = df4.withColumn(verificationColumn, lit(updatedVerificationValue))
df5.write.format("org.apache.hudi")
  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, tableType)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, recordKeyName)
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, partitionKeyName)
  .option(HoodieWriteConfig.TABLE_NAME, tableName)
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, precombineKeyName)
  // .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
partitionKeyName)
  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
  .mode(SaveMode.Append)
  .save(tablePath)

  val result1 = spark.sql(verificationSqlQuery)

  val df6 = spark.read.format("org.apache.hudi").load(loadTablePath)

df6.show
{code}
df6.show would return:
{code:java}
Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2043)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2031)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2030)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayB

[jira] [Assigned] (HUDI-1489) Not able to read after updating bootstrap table with written table

2020-12-22 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding reassigned HUDI-1489:
--

Assignee: Wenning Ding

> Not able to read after updating bootstrap table with written table
> --
>
> Key: HUDI-1489
> URL: https://issues.apache.org/jira/browse/HUDI-1489
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Major
>
> After updating Hudi table with the written bootstrap table, it would fail to 
> read the latest bootstrap table.
> h3. Reproduction steps
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.common.model.HoodieTableType
> import org.apache.hudi.config.HoodieBootstrapConfig
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.spark.sql.SaveMode
> import org.apache.spark.sql.SparkSession
> val bucketName = "wenningd-emr-dev"
> val tableName = "hudi_bootstrap_test_cow_5c1a5147_888e_4b638bef8"
> val recordKeyName = "event_id"
> val partitionKeyName = "event_type"
> val precombineKeyName = "event_time"
> val verificationRecordKey = "4"
> val verificationColumn = "event_name"
> val originalVerificationValue = "event_d"
> val updatedVerificationValue = "event_test"
> // val sourceTableWithoutHiveStylePartition = 
> "s3://wenningd-emr-dev/hudi/test-data/source_table/"
> // new parameters
> val sourceTableLocation = 
> "s3://wenningd-emr-dev/hudi/test-data/source_table/"
> val tableType = HoodieTableType.COPY_ON_WRITE.name()
> val verificationSqlQuery = "select " + verificationColumn + " from " + 
> tableName + " where " + recordKeyName + " = '" + verificationRecordKey + "'"
> val tablePath = "s3://" + bucketName + "/hudi/tables/" + tableName
> val loadTablePath = tablePath + "/*/*"
> // Create table and sync with hive
> val df = spark.emptyDataFrame
> val tableType = HoodieTableType.COPY_ON_WRITE.name
> // val tableType = HoodieTableType.MERGE_ON_READ.name
> df.write
>   .format("hudi")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP, 
> sourceTableLocation)
>   .option(HoodieBootstrapConfig.BOOTSTRAP_KEYGEN_CLASS, 
> "org.apache.hudi.keygen.SimpleKeyGenerator")
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, tableType)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, 
> recordKeyName)
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> partitionKeyName)
>   
> .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Overwrite)
>   .save(tablePath)
> // Verify create with spark sql query
> val result0 = spark.sql(verificationSqlQuery)
> if (!(result0.count == 1) || 
> !result0.collect.mkString.contains(originalVerificationValue)) {
>   throw new TestFailureException("Create table verification failed!")
> }
> val df3 = spark.read.format("org.apache.hudi").load(loadTablePath)
> val df4 = df3.filter(col(recordKeyName) === verificationRecordKey)
> val df5 = df4.withColumn(verificationColumn, lit(updatedVerificationValue))
> df5.write.format("org.apache.hudi")
>   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, tableType)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, recordKeyName)
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
> partitionKeyName)
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, precombineKeyName)
>   // .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> partitionKeyName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Append)
>   .save(tablePath)
>   val result1 = spark.sql(verificationSqlQuery)
>   val df6 = spark.read.format("org.apache.hudi").load(loadTablePath)
> df6.show
> {code}
> df6.show would return:
> {code:java}
> Driver stacktrace:
>   at 
> org.

[jira] [Updated] (HUDI-1461) Bulk insert v2 creates additional small files

2020-12-14 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1461:
---
Description: 
I took a look at the data preparation step for bulk insert, I found that 
current logic will create additional small files when performing bulk insert v2 
which will hurt the performance.

Current logic is to first sort the input dataframe and then do coalesce: 
[https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java#L104-L106]

For example, we set BulkInsertShuffleParallelism to 2 and have the following df 
as input:
{code:java}
val df = Seq(
  (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
  (108, "event_name_18", "2015-01-01T11:51:33.340396Z", "type1"),
  (109, "event_name_19", "2014-01-01T11:51:33.340396Z", "type3"),
  (110, "event_name_20", "2014-02-01T11:51:33.340396Z", "type3"),
  (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")
{code}
(Here I added a new column partitionID for better understanding) Based on the 
current logic, after sorting and coalesce, the dataframe would become:
{code:java}
val df2 = df.sort(functions.col("event_type"), 
functions.col("event_id")).coalesce(2)
df2.withColumn("partitionID", spark_partition_id).show(false)

++--+---+--+---+
|event_id|event_name|event_ts   |event_type|partitionID|
++--+---+--+---+
|100 |event_name_16 |2015-01-01T13:51:39.340396Z|type1 |0  |
|108 |event_name_18 |2015-01-01T11:51:33.340396Z|type1 |0  |
|105 |event_name_678|2015-01-01T13:51:42.248818Z|type2 |0  |
|110 |event_name_20 |2014-02-01T11:51:33.340396Z|type3 |0  |
|104 |event_name_123|2015-01-01T12:15:00.512679Z|type1 |1  |
|101 |event_name_546|2015-01-01T12:14:58.597216Z|type2 |1  |
|109 |event_name_19 |2014-01-01T11:51:33.340396Z|type3 |1  |
++--+---+--+---+
{code}
You can see the coalesce result actually does not depend on the sorting result. 
Each spark partition id contains 3 types of Hudi partitions.

So during the writing phase, each spark executor would get its corresponding 
partition id, and each executor would create 3 files under 3 Hudi partitions. 
Finally we have two parquet files under each Hudi partition. But with such a 
small dataset, ideally we should have single file under each Hudi partition.

If I change the sort to repartition:
{code:java}
val df3 = df.repartition(functions.col("event_type")).coalesce(2)
df3.withColumn("partitionID", spark_partition_id).show(false)

++--+---+--+---+
|event_id|event_name|event_ts   |event_type|partitionID|
++--+---+--+---+
|100 |event_name_16 |2015-01-01T13:51:39.340396Z|type1 |0  |
|104 |event_name_123|2015-01-01T12:15:00.512679Z|type1 |0  |
|108 |event_name_18 |2015-01-01T11:51:33.340396Z|type1 |0  |
|101 |event_name_546|2015-01-01T12:14:58.597216Z|type2 |1  |
|105 |event_name_678|2015-01-01T13:51:42.248818Z|type2 |1  |
|109 |event_name_19 |2014-01-01T11:51:33.340396Z|type3 |1  |
|110 |event_name_20 |2014-02-01T11:51:33.340396Z|type3 |1  |
++--+---+--+---+
{code}
In this case, we can have single file under each Hudi partition.

 

But according to our understanding, we still need the sort part so that we can 
get benefit from min/max record key index. So the problem is how should we 
correctly handle the logic.

Repartition and sort within each partition might be a way? Though sort within 
each partition might cause OOM issue if the data is unbalance.

  was:
I took a look at the data preparation step for bulk insert, I found that 
current logic will create additional small files when performing bulk insert v2 
which will hurt the performance.

Current logic is to first sort the input dataframe and then do coalesce: 
[https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java#L104-L106]

For example, the input df is:
{code:java}
val df = Seq(
  (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "t

[jira] [Updated] (HUDI-1461) Bulk insert v2 creates additional small files

2020-12-14 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1461:
---
Description: 
I took a look at the data preparation step for bulk insert, I found that 
current logic will create additional small files when performing bulk insert v2 
which will hurt the performance.

Current logic is to first sort the input dataframe and then do coalesce: 
[https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java#L104-L106]

For example, the input df is:
{code:java}
val df = Seq(
  (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
  (108, "event_name_18", "2015-01-01T11:51:33.340396Z", "type1"),
  (109, "event_name_19", "2014-01-01T11:51:33.340396Z", "type3"),
  (110, "event_name_20", "2014-02-01T11:51:33.340396Z", "type3"),
  (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")
{code}
(Here I added a new column partitionID for better understanding) Based on the 
current logic, after sorting and coalesce, the dataframe would become:
{code:java}
val df2 = df.sort(functions.col("event_type"), 
functions.col("event_id")).coalesce(2)
df2.withColumn("partitionID", spark_partition_id).show(false)

++--+---+--+---+
|event_id|event_name|event_ts   |event_type|partitionID|
++--+---+--+---+
|100 |event_name_16 |2015-01-01T13:51:39.340396Z|type1 |0  |
|108 |event_name_18 |2015-01-01T11:51:33.340396Z|type1 |0  |
|105 |event_name_678|2015-01-01T13:51:42.248818Z|type2 |0  |
|110 |event_name_20 |2014-02-01T11:51:33.340396Z|type3 |0  |
|104 |event_name_123|2015-01-01T12:15:00.512679Z|type1 |1  |
|101 |event_name_546|2015-01-01T12:14:58.597216Z|type2 |1  |
|109 |event_name_19 |2014-01-01T11:51:33.340396Z|type3 |1  |
++--+---+--+---+
{code}
You can see the coalesce result actually does not depend on the sorting result. 
Each spark partition id contains 3 types of Hudi partitions.

So during the writing phase, each spark executor would get its corresponding 
partition id, and each executor would create 3 files under 3 Hudi partitions. 
Finally we have two parquet files under each Hudi partition. But with such a 
small dataset, ideally we should have single file under each Hudi partition.

 

If I change the sort to repartition:
{code:java}
val df3 = df.repartition(functions.col("event_type")).coalesce(2)
df3.withColumn("partitionID", spark_partition_id).show(false)

++--+---+--+---+
|event_id|event_name|event_ts   |event_type|partitionID|
++--+---+--+---+
|100 |event_name_16 |2015-01-01T13:51:39.340396Z|type1 |0  |
|104 |event_name_123|2015-01-01T12:15:00.512679Z|type1 |0  |
|108 |event_name_18 |2015-01-01T11:51:33.340396Z|type1 |0  |
|101 |event_name_546|2015-01-01T12:14:58.597216Z|type2 |1  |
|105 |event_name_678|2015-01-01T13:51:42.248818Z|type2 |1  |
|109 |event_name_19 |2014-01-01T11:51:33.340396Z|type3 |1  |
|110 |event_name_20 |2014-02-01T11:51:33.340396Z|type3 |1  |
++--+---+--+---+
{code}
In this case, we can have single file under each Hudi partition.

 

But according to our understanding, we still need the sort part so that we can 
get benefit from min/max record key index. So the problem is how should we 
correctly handle the logic.

Repartition and sort within each partition might be a way?

  was:
I took a look at the data preparation step for bulk insert, I found that 
current logic will create additional small files when performing bulk insert v2 
which will hurt the performance.

Current logic is to first sort the input dataframe and then do coalesce: 
[https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java#L104-L106]

For example, the input df is:
{code:java}
val df = Seq(
  (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
  (108, "event_name_18", "2015-01-01T11:51:33.340396Z", "type1"),
  (109, "event_name_19", "2014-01-01T11:51:33.340396Z", "type3"),

[jira] [Updated] (HUDI-1461) Bulk insert v2 creates additional small files

2020-12-14 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1461:
---
Description: 
I took a look at the data preparation step for bulk insert, I found that 
current logic will create additional small files when performing bulk insert v2 
which will hurt the performance.

Current logic is to first sort the input dataframe and then do coalesce: 
[https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java#L104-L106]

For example, the input df is:
{code:java}
val df = Seq(
  (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
  (108, "event_name_18", "2015-01-01T11:51:33.340396Z", "type1"),
  (109, "event_name_19", "2014-01-01T11:51:33.340396Z", "type3"),
  (110, "event_name_20", "2014-02-01T11:51:33.340396Z", "type3"),
  (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")
{code}
(Here I added a new column partitionID for better understanding) Based on the 
current logic, after sorting and coalesce, the dataframe would become:
{code:java}
val df2 = df.sort(functions.col("event_type"), 
functions.col("event_id")).coalesce(2)
df2.withColumn("partitionID", spark_partition_id).show(false)

++--+---+--+---+
|event_id|event_name|event_ts   |event_type|partitionID|
++--+---+--+---+
|100 |event_name_16 |2015-01-01T13:51:39.340396Z|type1 |0  |
|108 |event_name_18 |2015-01-01T11:51:33.340396Z|type1 |0  |
|105 |event_name_678|2015-01-01T13:51:42.248818Z|type2 |0  |
|110 |event_name_20 |2014-02-01T11:51:33.340396Z|type3 |0  |
|104 |event_name_123|2015-01-01T12:15:00.512679Z|type1 |1  |
|101 |event_name_546|2015-01-01T12:14:58.597216Z|type2 |1  |
|109 |event_name_19 |2014-01-01T11:51:33.340396Z|type3 |1  |
++--+---+--+---+
{code}
You can see the coalesce result actually does not depend on the sorting result. 
Each spark partition id contains 3 types of Hudi partitions.

So during the writing phase, each spark executor would get its corresponding 
partition id, and each executor would create 3 files under 3 Hudi partitions. 
Finally we have two parquet files under each Hudi partition. But with such a 
small dataset, ideally we should have single file under each Hudi partition.

 

If I change the sort to repartition:
{code:java}
val df3 = df.repartition(functions.col("event_type")).coalesce(2)
df3.withColumn("partitionID", spark_partition_id).show(false)

++--+---+--+---+
|event_id|event_name|event_ts   |event_type|partitionID|
++--+---+--+---+
|100 |event_name_16 |2015-01-01T13:51:39.340396Z|type1 |0  |
|104 |event_name_123|2015-01-01T12:15:00.512679Z|type1 |0  |
|108 |event_name_18 |2015-01-01T11:51:33.340396Z|type1 |0  |
|101 |event_name_546|2015-01-01T12:14:58.597216Z|type2 |1  |
|105 |event_name_678|2015-01-01T13:51:42.248818Z|type2 |1  |
|109 |event_name_19 |2014-01-01T11:51:33.340396Z|type3 |1  |
|110 |event_name_20 |2014-02-01T11:51:33.340396Z|type3 |1  |
++--+---+--+---+
{code}
In this case, we can have single file under each Hudi partition.

 

But according to our understanding, we still need the sort part so that we can 
get benefit from min/max record key index. So the problem is how should we 
correctly handle the logic.

  was:
I took a look at the data preparation step for bulk insert, I found that 
current logic will create additional small files when performing bulk insert v2 
which will hurt the performance.

Current logic is to first sort the input dataframe and then do coalesce: 
[https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java#L104-L106]

For example, the input df is:
{code:java}
val df = Seq(
  (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
  (108, "event_name_18", "2015-01-01T11:51:33.340396Z", "type1"),
  (109, "event_name_19", "2014-01-01T11:51:33.340396Z", "type3"),
  (110, "event_name_20", "2014-02-01T11:51:33.340396Z", "type

[jira] [Created] (HUDI-1461) Bulk insert v2 creates additional small files

2020-12-14 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-1461:
--

 Summary: Bulk insert v2 creates additional small files
 Key: HUDI-1461
 URL: https://issues.apache.org/jira/browse/HUDI-1461
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Wenning Ding


I took a look at the data preparation step for bulk insert, I found that 
current logic will create additional small files when performing bulk insert v2 
which will hurt the performance.

Current logic is to first sort the input dataframe and then do coalesce: 
[https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java#L104-L106]

For example, the input df is:
{code:java}
val df = Seq(
  (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
  (108, "event_name_18", "2015-01-01T11:51:33.340396Z", "type1"),
  (109, "event_name_19", "2014-01-01T11:51:33.340396Z", "type3"),
  (110, "event_name_20", "2014-02-01T11:51:33.340396Z", "type3"),
  (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")
{code}
(Here I added a new column partitionID for better understanding) Based on the 
current logic, after sorting and coalesce, the dataframe would become:
{code:java}
val df2 = df.sort(functions.col("event_type"), 
functions.col("event_id")).coalesce(2)
df2.withColumn("partitionID", spark_partition_id).show(false)

++--+---+--+---+
|event_id|event_name|event_ts   |event_type|partitionID|
++--+---+--+---+
|100 |event_name_16 |2015-01-01T13:51:39.340396Z|type1 |0  |
|108 |event_name_18 |2015-01-01T11:51:33.340396Z|type1 |0  |
|105 |event_name_678|2015-01-01T13:51:42.248818Z|type2 |0  |
|110 |event_name_20 |2014-02-01T11:51:33.340396Z|type3 |0  |
|104 |event_name_123|2015-01-01T12:15:00.512679Z|type1 |1  |
|101 |event_name_546|2015-01-01T12:14:58.597216Z|type2 |1  |
|109 |event_name_19 |2014-01-01T11:51:33.340396Z|type3 |1  |
++--+---+--+---+
{code}
You can see the coalesce result actually is not related to the sorting result. 
And each spark partition id contains 3 types of Hudi partitions.

So during the writing phase, each spark executor would get its corresponding 
partition id, and each executor would create 3 files under 3 Hudi partitions. 
Finally we have two parquet files under each Hudi partition. But with such a 
small dataset, ideally we should have single file under each Hudi partition.

 

If I change the sort to repartition:
{code:java}
val df3 = df.repartition(functions.col("event_type")).coalesce(2)
df3.withColumn("partitionID", spark_partition_id).show(false)

++--+---+--+---+
|event_id|event_name|event_ts   |event_type|partitionID|
++--+---+--+---+
|100 |event_name_16 |2015-01-01T13:51:39.340396Z|type1 |0  |
|104 |event_name_123|2015-01-01T12:15:00.512679Z|type1 |0  |
|108 |event_name_18 |2015-01-01T11:51:33.340396Z|type1 |0  |
|101 |event_name_546|2015-01-01T12:14:58.597216Z|type2 |1  |
|105 |event_name_678|2015-01-01T13:51:42.248818Z|type2 |1  |
|109 |event_name_19 |2014-01-01T11:51:33.340396Z|type3 |1  |
|110 |event_name_20 |2014-02-01T11:51:33.340396Z|type3 |1  |
++--+---+--+---+
{code}
In this case, we can have single file under each Hudi partition.

 

But according to our understanding, we still need the sort part so that we can 
get benefit from min/max record key index. So the problem is how should we 
correctly handle the logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1451) Support bulk insert v2 with Spark 3

2020-12-10 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-1451:
--

 Summary: Support bulk insert v2 with Spark 3
 Key: HUDI-1451
 URL: https://issues.apache.org/jira/browse/HUDI-1451
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Wenning Ding


Recently, we made Hudi support Spark 3.0.0: 
https://github.com/apache/hudi/pull/2208. However, there was a large refactor 
done to Spark datasource V2 API interfaces from version 2.4.4 → 3.0.0. So 
currently our bulk insert v2(https://issues.apache.org/jira/browse/HUDI-1013) 
feature is marked as unsupported with Spark3.

We need to redesign the bulk insert v2 part with datasource v2 api in Spark 
3.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1432) Better organize classes in hudi-spark-common & hudi-spark

2020-12-03 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-1432:
--

 Summary: Better organize classes in hudi-spark-common & hudi-spark
 Key: HUDI-1432
 URL: https://issues.apache.org/jira/browse/HUDI-1432
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Wenning Ding


In this pr: [https://github.com/apache/hudi/pull/2208,] we refactored 
hudi-spark module and split it into four submodules: hudi-spark, 
hudi-spark-common, hudi-spark2 and hudi-spark3.

 

We only moved two classes to hudi-spark-common: DataSourceWriteOptions, 
DataSourceUtils. But ideally we should move more classes to hudi-spark-common.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1396) Bootstrap job via Hudi datasource hangs at the end of the spark-submit job

2020-11-13 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-1396:
--

 Summary: Bootstrap job via Hudi datasource hangs at the end of the 
spark-submit job
 Key: HUDI-1396
 URL: https://issues.apache.org/jira/browse/HUDI-1396
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Wenning Ding


Bootstrap job via Hudi datasource hangs at the end of the spark-submit job

 

This issue is similar to https://issues.apache.org/jira/browse/HUDI-1230. 
Basically, {{HoodieWriteClient}} at 
[https://github.com/apache/hudi/blob/release-0.6.0/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L255]
 will not be closed and as a result, the corresponding timeline server will not 
stop at the end. Therefore the job hangs and never exits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1376) Drop Hudi metadata columns before Spark datasource writing

2020-11-07 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1376:
---
Description: 
When updating a Hudi table through Spark datasource, it will use the schema of 
the input dataframe as the schema stored in the commit files. Thus, when 
upserted with rows containing metadata columns, the upsert commit file will 
store the metadata columns schema in the commit file which is unnecessary for 
common cases. And also this will bring an issue for bootstrap table.

Since metadata columns are not used during the Spark datasource writing 
process, we can drop those columns in the beginning.

  was:
When updating a Hudi table through Spark datasource, it will use the schema of 
the input dataframe as the schema stored in the commit files. Thus, when 
upserted with rows containing metadata columns, the upsert commit file will 
store the metadata columns schema in the commit file which is unnecessary for 
common cases. And also this will bring an issue for bootstrap table.

Since metadata columns are not used during the Spark datasource writing 
process, we can drop those columns in the beginning of Spark datasource.


> Drop Hudi metadata columns before Spark datasource writing 
> ---
>
> Key: HUDI-1376
> URL: https://issues.apache.org/jira/browse/HUDI-1376
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
>
> When updating a Hudi table through Spark datasource, it will use the schema 
> of the input dataframe as the schema stored in the commit files. Thus, when 
> upserted with rows containing metadata columns, the upsert commit file will 
> store the metadata columns schema in the commit file which is unnecessary for 
> common cases. And also this will bring an issue for bootstrap table.
> Since metadata columns are not used during the Spark datasource writing 
> process, we can drop those columns in the beginning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1376) Drop Hudi metadata columns before Spark datasource writing

2020-11-07 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1376:
---
Description: 
When updating a Hudi table through Spark datasource, it will use the schema of 
the input dataframe as the schema stored in the commit files. Thus, when 
upserted with rows containing metadata columns, the upsert commit file will 
store the metadata columns schema in the commit file which is unnecessary for 
common cases. And also this will bring an issue for bootstrap table.

Since metadata columns are not used during the Spark datasource writing 
process, we can drop those columns in the beginning of Spark datasource.

  was:
When updating a Hudi table through Spark datasource, it will use the schema of 
the input dataframe as the schema stored in the commit files. Thus, when 
upserted with rows containing metadata columns, the upsert commit file will 
store the metadata columns schema in the commit file which is unnecessary for 
common cases. And also this will bring an issue for bootstrap table.

Since metadata columns is not used during the Spark datasource processing, we 
can drop those columns in the beginning of Spark datasource.


> Drop Hudi metadata columns before Spark datasource writing 
> ---
>
> Key: HUDI-1376
> URL: https://issues.apache.org/jira/browse/HUDI-1376
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
>
> When updating a Hudi table through Spark datasource, it will use the schema 
> of the input dataframe as the schema stored in the commit files. Thus, when 
> upserted with rows containing metadata columns, the upsert commit file will 
> store the metadata columns schema in the commit file which is unnecessary for 
> common cases. And also this will bring an issue for bootstrap table.
> Since metadata columns are not used during the Spark datasource writing 
> process, we can drop those columns in the beginning of Spark datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1376) Drop metadata columns before Spark datasource processing

2020-11-07 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1376:
---
Description: 
When updating a Hudi table through Spark datasource, it will use the schema of 
the input dataframe as the schema stored in the commit files. Thus, when 
upserted with rows containing metadata columns, the upsert commit file will 
store the metadata columns schema in the commit file which is unnecessary for 
common cases. And also this will bring an issue for bootstrap table.

Since metadata columns is not used during the Spark datasource processing, we 
can drop those columns in the beginning of Spark datasource.

  was:
When updating a Hudi table through Spark datasource, it will use the schema of 
the input dataframe as the schema stored in the commit files. Thus, when 
upserted with rows containing metadata columns, the upsert commit file will 
store the metadata columns schema in the commit file which is unnecessary for 
common cases. And also this will bring an issue for bootstrap table.

Since the schema of metadata columns is always the same, we should remove the 
schema of metadata columns in the commit file for any insert/upsert/... action.


> Drop metadata columns before Spark datasource processing 
> -
>
> Key: HUDI-1376
> URL: https://issues.apache.org/jira/browse/HUDI-1376
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
>
> When updating a Hudi table through Spark datasource, it will use the schema 
> of the input dataframe as the schema stored in the commit files. Thus, when 
> upserted with rows containing metadata columns, the upsert commit file will 
> store the metadata columns schema in the commit file which is unnecessary for 
> common cases. And also this will bring an issue for bootstrap table.
> Since metadata columns is not used during the Spark datasource processing, we 
> can drop those columns in the beginning of Spark datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1376) Drop Hudi metadata columns before Spark datasource writing

2020-11-07 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1376:
---
Summary: Drop Hudi metadata columns before Spark datasource writing   (was: 
Drop metadata columns before Spark datasource processing )

> Drop Hudi metadata columns before Spark datasource writing 
> ---
>
> Key: HUDI-1376
> URL: https://issues.apache.org/jira/browse/HUDI-1376
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
>
> When updating a Hudi table through Spark datasource, it will use the schema 
> of the input dataframe as the schema stored in the commit files. Thus, when 
> upserted with rows containing metadata columns, the upsert commit file will 
> store the metadata columns schema in the commit file which is unnecessary for 
> common cases. And also this will bring an issue for bootstrap table.
> Since metadata columns is not used during the Spark datasource processing, we 
> can drop those columns in the beginning of Spark datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1376) Drop metadata columns before Spark datasource processing

2020-11-07 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1376:
---
Summary: Drop metadata columns before Spark datasource processing   (was: 
Remove the schema of metadata columns in the commit files)

> Drop metadata columns before Spark datasource processing 
> -
>
> Key: HUDI-1376
> URL: https://issues.apache.org/jira/browse/HUDI-1376
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
>
> When updating a Hudi table through Spark datasource, it will use the schema 
> of the input dataframe as the schema stored in the commit files. Thus, when 
> upserted with rows containing metadata columns, the upsert commit file will 
> store the metadata columns schema in the commit file which is unnecessary for 
> common cases. And also this will bring an issue for bootstrap table.
> Since the schema of metadata columns is always the same, we should remove the 
> schema of metadata columns in the commit file for any insert/upsert/... 
> action.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1376) Remove the schema of metadata columns in the commit files

2020-11-05 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-1376:
--

 Summary: Remove the schema of metadata columns in the commit files
 Key: HUDI-1376
 URL: https://issues.apache.org/jira/browse/HUDI-1376
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Wenning Ding


When updating a Hudi table through Spark datasource, it will use the schema of 
the input dataframe as the schema stored in the commit files. Thus, when 
upserted with rows containing metadata columns, the upsert commit file will 
store the metadata columns schema in the commit file which is unnecessary for 
common cases. And also this will bring an issue for bootstrap table.

Since the schema of metadata columns is always the same, we should remove the 
schema of metadata columns in the commit file for any insert/upsert/... action.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-1376) Remove the schema of metadata columns in the commit files

2020-11-05 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding reassigned HUDI-1376:
--

Assignee: Wenning Ding

> Remove the schema of metadata columns in the commit files
> -
>
> Key: HUDI-1376
> URL: https://issues.apache.org/jira/browse/HUDI-1376
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Major
>
> When updating a Hudi table through Spark datasource, it will use the schema 
> of the input dataframe as the schema stored in the commit files. Thus, when 
> upserted with rows containing metadata columns, the upsert commit file will 
> store the metadata columns schema in the commit file which is unnecessary for 
> common cases. And also this will bring an issue for bootstrap table.
> Since the schema of metadata columns is always the same, we should remove the 
> schema of metadata columns in the commit file for any insert/upsert/... 
> action.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1375) Fix bug for HoodieAvroUtils.removeMetadataFields

2020-11-05 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-1375:
--

 Summary: Fix bug for HoodieAvroUtils.removeMetadataFields
 Key: HUDI-1375
 URL: https://issues.apache.org/jira/browse/HUDI-1375
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Wenning Ding


There's a code bug in HoodieAvroUtils.removeMetadataFields().

Reproduction steps:

 

{{}}
{code:java}
import org.apache.hudi.avro.HoodieAvroUtils
import org.apache.avro.Schema
val EXAMPLE_SCHEMA = """{"type": "record","name": "testrec","fields": [ 
{"name": "timestamp","type": "double"},{"name": "_row_key", "type": 
"string"},{"name": "non_pii_col", "type": "string"},{"name": "pii_col", "type": 
"string", "column_category": "user_profile"}]}"""
val schema = 
HoodieAvroUtils.addMetadataFields(newSchema.Parser().parse(EXAMPLE_SCHEMA))
HoodieAvroUtils.removeMetadataFields(schema)
{code}
{{}}

Then it would throw an exception:

 

{{}}
{code:java}
org.apache.avro.AvroRuntimeException: Field already used: timestamp type:DOUBLE 
pos:5 at org.apache.avro.Schema$RecordSchema.setFields(Schema.java:647) at 
org.apache.hudi.avro.HoodieAvroUtils.removeMetadataFields(HoodieAvroUtils.java:209)
 ... 49 elided {code}
{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-1375) Fix bug for HoodieAvroUtils.removeMetadataFields

2020-11-05 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding reassigned HUDI-1375:
--

Assignee: Wenning Ding

> Fix bug for HoodieAvroUtils.removeMetadataFields
> 
>
> Key: HUDI-1375
> URL: https://issues.apache.org/jira/browse/HUDI-1375
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Major
>
> There's a code bug in HoodieAvroUtils.removeMetadataFields().
> Reproduction steps:
>  
> {{}}
> {code:java}
> import org.apache.hudi.avro.HoodieAvroUtils
> import org.apache.avro.Schema
> val EXAMPLE_SCHEMA = """{"type": "record","name": "testrec","fields": [ 
> {"name": "timestamp","type": "double"},{"name": "_row_key", "type": 
> "string"},{"name": "non_pii_col", "type": "string"},{"name": "pii_col", 
> "type": "string", "column_category": "user_profile"}]}"""
> val schema = 
> HoodieAvroUtils.addMetadataFields(newSchema.Parser().parse(EXAMPLE_SCHEMA))
> HoodieAvroUtils.removeMetadataFields(schema)
> {code}
> {{}}
> Then it would throw an exception:
>  
> {{}}
> {code:java}
> org.apache.avro.AvroRuntimeException: Field already used: timestamp 
> type:DOUBLE pos:5 at 
> org.apache.avro.Schema$RecordSchema.setFields(Schema.java:647) at 
> org.apache.hudi.avro.HoodieAvroUtils.removeMetadataFields(HoodieAvroUtils.java:209)
>  ... 49 elided {code}
> {{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-1369) Bootstrap datasource jobs from hanging via spark-submit

2020-11-05 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding reassigned HUDI-1369:
--

Assignee: Wenning Ding

> Bootstrap datasource jobs from hanging via spark-submit
> ---
>
> Key: HUDI-1369
> URL: https://issues.apache.org/jira/browse/HUDI-1369
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Major
>
> MOR table creation via Hudi datasource hangs at the end of the spark-submit 
> job.
> Looks like {{HoodieWriteClient}} at 
> [https://github.com/apache/hudi/blob/release-0.6.0/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L255]
>  not being closed which does not stop the timeline server at the end, and as 
> a result the job hangs and never exits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1369) Bootstrap datasource jobs from hanging via spark-submit

2020-11-03 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-1369:
--

 Summary: Bootstrap datasource jobs from hanging via spark-submit
 Key: HUDI-1369
 URL: https://issues.apache.org/jira/browse/HUDI-1369
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Wenning Ding


MOR table creation via Hudi datasource hangs at the end of the spark-submit job.

Looks like {{HoodieWriteClient}} at 
[https://github.com/apache/hudi/blob/release-0.6.0/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L255]
 not being closed which does not stop the timeline server at the end, and as a 
result the job hangs and never exits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1021) [Bug] Unable to update bootstrapped table using rows from the written bootstrapped table

2020-08-31 Thread Wenning Ding (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17187969#comment-17187969
 ] 

Wenning Ding commented on HUDI-1021:


Verified this issue still exists in 0.6.0 release. Will work on fixing it.

> [Bug] Unable to update bootstrapped table using rows from the written 
> bootstrapped table
> 
>
> Key: HUDI-1021
> URL: https://issues.apache.org/jira/browse/HUDI-1021
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: bootstrap
>Reporter: Udit Mehrotra
>Assignee: Wenning Ding
>Priority: Blocker
> Fix For: 0.6.1
>
>
> Reproduction Steps:
>  
> {code:java}
> import spark.implicits._
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.HoodieDataSourceHelpers
> import org.apache.hudi.common.model.HoodieTableType
> import org.apache.spark.sql.SaveMode
> val sourcePath = 
> "s3://uditme-iad/hudi/tables/events/events_data_partitioned_non_null"
> val sourceDf = spark.read.parquet(sourcePath + "/*")
> var tableName = "events_data_partitioned_non_null_00"
> var tablePath = "s3://emr-users/uditme/hudi/tables/events/" + tableName
> sourceDf.write.format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, tableName)
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>  .mode(SaveMode.Overwrite)
>  .save(tablePath)
> val readDf = spark.read.format("org.apache.hudi").load(tablePath + "/*")
> val updateDf = readDf.filter($"event_id" === "106")
>  .withColumn("event_name", lit("udit_event_106"))
>  
> updateDf.write.format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, tableName)
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>  .mode(SaveMode.Append)
>  .save(tablePath)
> {code}
>  
> Full Stack trace:
> {noformat}
> Caused by: org.apache.hudi.exception.HoodieUpsertException: Error upserting 
> bucketType UPDATE for partition :0
>  at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:276)
>  at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:102)
>  at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>  at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:359)
>  at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:357)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1181)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1155)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1090)
>  at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1155)
>  at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:881)
>  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:308)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>  at org.apache.spark.scheduler.Result

[jira] [Created] (HUDI-1194) Reorganize HoodieHiveClient and make it fully support Hive Metastore API

2020-08-16 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-1194:
--

 Summary: Reorganize HoodieHiveClient and make it fully support 
Hive Metastore API
 Key: HUDI-1194
 URL: https://issues.apache.org/jira/browse/HUDI-1194
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Wenning Ding


Currently there are three ways in HoodieHiveClient to perform Hive 
functionalities. One is through Hive JDBC, one is through Hive Metastore API. 
One is through Hive Driver.
 
 There’s a parameter called +{{hoodie.datasource.hive_sync.use_jdbc}}+ to 
control whether use Hive JDBC or not. However, this parameter does not 
accurately describe the situation.


 Basically, current logic is when set +*use_jdbc*+ to true, most of the methods 
in HoodieHiveClient will use JDBC, and few methods in HoodieHiveClient will use 
Hive Metastore API.
 When set +*use_jdbc*+ to false, most of the methods in HoodieHiveClient will 
use Hive Driver, and few methods in HoodieHiveClient will use Hive Metastore 
API.

Here is a table shows that what will actually be used when setting use_jdbc to 
ture/false.
|Method|use_jdbc=true|use_jdbc=false|
|{{addPartitionsToTable}}|JDBC|Hive Driver|
|{{updatePartitionsToTable}}|JDBC|Hive Driver|
|{{scanTablePartitions}}|Metastore API|Metastore API|
|{{updateTableDefinition}}|JDBC|Hive Driver|
|{{createTable}}|JDBC|Hive Driver|
|{{getTableSchema}}|JDBC|Metastore API|
|{{doesTableExist}}|Metastore API|Metastore API|
|getLastCommitTimeSynced|Metastore API|Metastore API|

[~bschell] and I developed several Metastore API implementation for 
{{createTable, }}{{addPartitionsToTable}}{{, }}{{updatePartitionsToTable}}{{, 
}}{{updateTableDefinition }}{{which will be helpful for several issues: e.g. 
resolving null partition hive sync issue and supporting ALTER_TABLE cascade 
with AWS glue catalog}}{{. }}

{{But it seems hard to organize three implementations within the current 
config. So we plan to separate HoodieHiveClient into three classes:}}
 # {{HoodieHiveClient which implements all the APIs through Metastore API.}}
 # {{HoodieHiveJDBCClient which extends from HoodieHiveClient and overwrite 
several the APIs through Hive JDBC.}}
 # {{HoodieHiveDriverClient which extends from HoodieHiveClient and overwrite 
several the APIs through Hive Driver.}}

{{And we introduce a new parameter 
}}+*hoodie.datasource.hive_sync.hive_client_class*+ which could** _**_ let you 
choose which Hive Client class to use.

{{}}

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-1181) Decimal type display issue for record key field

2020-08-11 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding reassigned HUDI-1181:
--

Assignee: Wenning Ding

> Decimal type display issue for record key field
> ---
>
> Key: HUDI-1181
> URL: https://issues.apache.org/jira/browse/HUDI-1181
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
>
> When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would 
> not correctly display the decimal value, instead, Hudi would display it as a 
> byte array.
> During the Hudi writing phase, Hudi would save the parquet source data into 
> Avro Generic Record. For example, the source parquet data has a column with 
> decimal type:
>  
> {code:java}
> optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code}
>  
> Then Hudi will convert it into the following avro decimal type:
> {code:java}
> {
> "name" : "OBJ_ID",
> "type" : [ {
>   "type" : "fixed",
>   "name" : "fixed",
>   "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID",
>   "size" : 16,
>   "logicalType" : "decimal",
>   "precision" : 38,
>   "scale" : 0
> }, "null" ]
> }
> {code}
> This decimal field would be stored as a fixed length bytes array. And in the 
> reading phase, Hudi will convert this bytes array back to a readable decimal 
> value through this 
> [converter|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L58].
> However, the problem is, when setting decimal type as record keys, Hudi would 
> read the value from Avro Generic Record and then directly convert it into 
> String type(See 
> [here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]).
> As a result, what shows in the _hoodie_record_key field would be something 
> like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So 
> we need to handle this special case to convert bytes array back before 
> converting to String.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1181) Decimal type display issue for record key field

2020-08-11 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1181:
---
Description: 
When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would 
not correctly display the decimal value, instead, Hudi would display it as a 
byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:

 
{code:java}
optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code}
 

Then Hudi will convert it into the following avro decimal type:
{code:java}
{
"name" : "OBJ_ID",
"type" : [ {
  "type" : "fixed",
  "name" : "fixed",
  "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID",
  "size" : 16,
  "logicalType" : "decimal",
  "precision" : 38,
  "scale" : 0
}, "null" ]
}
{code}
This decimal field would be stored as a fixed length bytes array. And in the 
reading phase, Hudi will convert this bytes array back to a readable decimal 
value through this converter.

However, the problem is, when setting decimal type as record keys, Hudi would 
read the value from Avro Generic Record and then directly convert it into 
String type(See 
[here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]).

As a result, what shows in the _hoodie_record_key field would be something 
like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So 
we need to handle this special case to convert bytes array back before 
converting to String.

  was:
When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would 
not correctly display the decimal value, instead, Hudi would display it as a 
byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:

 
{code:java}
optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code}
 

Then Hudi will convert it into the following avro decimal type:
{code:java}
{
"name" : "LN_LQDN_OBJ_ID",
"type" : [ {
  "type" : "fixed",
  "name" : "fixed",
  "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID",
  "size" : 16,
  "logicalType" : "decimal",
  "precision" : 38,
  "scale" : 0
}, "null" ]
}
{code}
This decimal field would be stored as a fixed length bytes array. And in the 
reading phase, Hudi will convert this bytes array back to a readable decimal 
value through this converter.

However, the problem is, when setting decimal type as record keys, Hudi would 
read the value from Avro Generic Record and then directly convert it into 
String type(See 
[here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]).

As a result, what shows in the _hoodie_record_key field would be something 
like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So 
we need to handle this special case to convert bytes array back before 
converting to String.


> Decimal type display issue for record key field
> ---
>
> Key: HUDI-1181
> URL: https://issues.apache.org/jira/browse/HUDI-1181
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>
> When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would 
> not correctly display the decimal value, instead, Hudi would display it as a 
> byte array.
> During the Hudi writing phase, Hudi would save the parquet source data into 
> Avro Generic Record. For example, the source parquet data has a column with 
> decimal type:
>  
> {code:java}
> optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code}
>  
> Then Hudi will convert it into the following avro decimal type:
> {code:java}
> {
> "name" : "OBJ_ID",
> "type" : [ {
>   "type" : "fixed",
>   "name" : "fixed",
>   "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID",
>   "size" : 16,
>   "logicalType" : "decimal",
>   "precision" : 38,
>   "scale" : 0
> }, "null" ]
> }
> {code}
> This decimal field would be stored as a fixed length bytes array. And in the 
> reading phase, Hudi will convert this bytes array back to a readable decimal 
> value through this converter.
> However, the problem is, when setting decimal type as record keys, Hudi would 
> read the value from Avro Generic Record and then directly convert it into 
> String type(See 
> [here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]).
> As a result, what shows in the _hoodie_record_key field would be something 
> like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71

[jira] [Updated] (HUDI-1181) Decimal type display issue for record key field

2020-08-11 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1181:
---
Description: 
When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would 
not correctly display the decimal value, instead, Hudi would display it as a 
byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:

 
{code:java}
optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code}
 

Then Hudi will convert it into the following avro decimal type:
{code:java}
{
"name" : "OBJ_ID",
"type" : [ {
  "type" : "fixed",
  "name" : "fixed",
  "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID",
  "size" : 16,
  "logicalType" : "decimal",
  "precision" : 38,
  "scale" : 0
}, "null" ]
}
{code}
This decimal field would be stored as a fixed length bytes array. And in the 
reading phase, Hudi will convert this bytes array back to a readable decimal 
value through this 
[converter|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L58].

However, the problem is, when setting decimal type as record keys, Hudi would 
read the value from Avro Generic Record and then directly convert it into 
String type(See 
[here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]).

As a result, what shows in the _hoodie_record_key field would be something 
like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So 
we need to handle this special case to convert bytes array back before 
converting to String.

  was:
When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would 
not correctly display the decimal value, instead, Hudi would display it as a 
byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:

 
{code:java}
optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code}
 

Then Hudi will convert it into the following avro decimal type:
{code:java}
{
"name" : "OBJ_ID",
"type" : [ {
  "type" : "fixed",
  "name" : "fixed",
  "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID",
  "size" : 16,
  "logicalType" : "decimal",
  "precision" : 38,
  "scale" : 0
}, "null" ]
}
{code}
This decimal field would be stored as a fixed length bytes array. And in the 
reading phase, Hudi will convert this bytes array back to a readable decimal 
value through this converter.

However, the problem is, when setting decimal type as record keys, Hudi would 
read the value from Avro Generic Record and then directly convert it into 
String type(See 
[here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]).

As a result, what shows in the _hoodie_record_key field would be something 
like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So 
we need to handle this special case to convert bytes array back before 
converting to String.


> Decimal type display issue for record key field
> ---
>
> Key: HUDI-1181
> URL: https://issues.apache.org/jira/browse/HUDI-1181
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>
> When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would 
> not correctly display the decimal value, instead, Hudi would display it as a 
> byte array.
> During the Hudi writing phase, Hudi would save the parquet source data into 
> Avro Generic Record. For example, the source parquet data has a column with 
> decimal type:
>  
> {code:java}
> optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code}
>  
> Then Hudi will convert it into the following avro decimal type:
> {code:java}
> {
> "name" : "OBJ_ID",
> "type" : [ {
>   "type" : "fixed",
>   "name" : "fixed",
>   "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID",
>   "size" : 16,
>   "logicalType" : "decimal",
>   "precision" : 38,
>   "scale" : 0
> }, "null" ]
> }
> {code}
> This decimal field would be stored as a fixed length bytes array. And in the 
> reading phase, Hudi will convert this bytes array back to a readable decimal 
> value through this 
> [converter|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L58].
> However, the problem is, when setting decimal type as record keys, Hudi would 
> read the value from Avro Generic Record and then directly convert it into 
> String type(See 
> [here|https://github.com/apache/hudi/bl

[jira] [Updated] (HUDI-1181) Decimal type display issue for record key field

2020-08-11 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1181:
---
Description: 
When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would 
not correctly display the decimal value, instead, Hudi would display it as a 
byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:

 
{code:java}
optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code}
 

Then Hudi will convert it into the following avro decimal type:
{code:java}
{
"name" : "LN_LQDN_OBJ_ID",
"type" : [ {
  "type" : "fixed",
  "name" : "fixed",
  "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID",
  "size" : 16,
  "logicalType" : "decimal",
  "precision" : 38,
  "scale" : 0
}, "null" ]
}
{code}
This decimal field would be stored as a fixed length bytes array. And in the 
reading phase, Hudi will convert this bytes array back to a readable decimal 
value through this converter.

However, the problem is, when setting decimal type as record keys, Hudi would 
read the value from Avro Generic Record and then directly convert it into 
String type(See 
[here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]).

As a result, what shows in the _hoodie_record_key field would be something 
like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So 
we need to handle this special case to convert bytes array back before 
converting to String.

  was:
When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi 
would not correctly display the decimal value, instead, Hudi would display it 
as a byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:

{

optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0));
 }
 Then Hudi will convert it into the following avro decimal type:


> Decimal type display issue for record key field
> ---
>
> Key: HUDI-1181
> URL: https://issues.apache.org/jira/browse/HUDI-1181
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>
> When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would 
> not correctly display the decimal value, instead, Hudi would display it as a 
> byte array.
> During the Hudi writing phase, Hudi would save the parquet source data into 
> Avro Generic Record. For example, the source parquet data has a column with 
> decimal type:
>  
> {code:java}
> optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code}
>  
> Then Hudi will convert it into the following avro decimal type:
> {code:java}
> {
> "name" : "LN_LQDN_OBJ_ID",
> "type" : [ {
>   "type" : "fixed",
>   "name" : "fixed",
>   "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID",
>   "size" : 16,
>   "logicalType" : "decimal",
>   "precision" : 38,
>   "scale" : 0
> }, "null" ]
> }
> {code}
> This decimal field would be stored as a fixed length bytes array. And in the 
> reading phase, Hudi will convert this bytes array back to a readable decimal 
> value through this converter.
> However, the problem is, when setting decimal type as record keys, Hudi would 
> read the value from Avro Generic Record and then directly convert it into 
> String type(See 
> [here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]).
> As a result, what shows in the _hoodie_record_key field would be something 
> like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So 
> we need to handle this special case to convert bytes array back before 
> converting to String.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1181) Decimal type display issue for record key field

2020-08-11 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1181:
---
Description: 
When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi 
would not correctly display the decimal value, instead, Hudi would display it 
as a byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:

{

optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0));
 }
 Then Hudi will convert it into the following avro decimal type:

  was:
When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi 
would not correctly display the decimal value, instead, Hudi would display it 
as a byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:
{
optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0));
}
Then Hudi will convert it into the following avro decimal type:



> Decimal type display issue for record key field
> ---
>
> Key: HUDI-1181
> URL: https://issues.apache.org/jira/browse/HUDI-1181
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>
> When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi 
> would not correctly display the decimal value, instead, Hudi would display it 
> as a byte array.
> During the Hudi writing phase, Hudi would save the parquet source data into 
> Avro Generic Record. For example, the source parquet data has a column with 
> decimal type:
> {
> optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0));
>  }
>  Then Hudi will convert it into the following avro decimal type:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1181) Decimal type display issue for record key field

2020-08-11 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1181:
---
Description: 
When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi 
would not correctly display the decimal value, instead, Hudi would display it 
as a byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:
{
optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0));
}
Then Hudi will convert it into the following avro decimal type:


  was:
When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi 
would not correctly display the decimal value, instead, Hudi would display it 
as a byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:
```
optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0));
```
Then Hudi will convert it into the following avro decimal type:



> Decimal type display issue for record key field
> ---
>
> Key: HUDI-1181
> URL: https://issues.apache.org/jira/browse/HUDI-1181
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>
> When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi 
> would not correctly display the decimal value, instead, Hudi would display it 
> as a byte array.
> During the Hudi writing phase, Hudi would save the parquet source data into 
> Avro Generic Record. For example, the source parquet data has a column with 
> decimal type:
> {
> optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0));
> }
> Then Hudi will convert it into the following avro decimal type:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1181) Decimal type display issue for record key field

2020-08-11 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-1181:
--

 Summary: Decimal type display issue for record key field
 Key: HUDI-1181
 URL: https://issues.apache.org/jira/browse/HUDI-1181
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Wenning Ding


When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi 
would not correctly display the decimal value, instead, Hudi would display it 
as a byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:
```
optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0));
```
Then Hudi will convert it into the following avro decimal type:




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1180) Upgrade HBase to 2.3.3

2020-08-11 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-1180:
--

 Summary: Upgrade HBase to 2.3.3
 Key: HUDI-1180
 URL: https://issues.apache.org/jira/browse/HUDI-1180
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Wenning Ding


Trying to upgrade HBase to 2.3.3 but ran into several issues.

According to the Hadoop version support matrix: 
[http://hbase.apache.org/book.html#hadoop], also need to upgrade Hadoop to 
2.8.5+.

 

There are several API conflicts between HBase 2.2.3 and HBase 1.2.3, we need to 
resolve this first. After resolving conflicts, I am able to compile it but then 
I ran into a tricky jetty version issue during the testing:
{code:java}
[ERROR] TestHBaseIndex.testDelete()  Time elapsed: 4.705 s  <<< ERROR!
java.lang.NoSuchMethodError: 
org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V


[ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdate()  Time elapsed: 0.174 s  
<<< ERROR!
java.lang.NoSuchMethodError: 
org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V


[ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdateWithRollback()  Time 
elapsed: 0.076 s  <<< ERROR!
java.lang.NoSuchMethodError: 
org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V


[ERROR] TestHBaseIndex.testSmallBatchSize()  Time elapsed: 0.122 s  <<< ERROR!
java.lang.NoSuchMethodError: 
org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V


[ERROR] TestHBaseIndex.testTagLocationAndDuplicateUpdate()  Time elapsed: 0.16 
s  <<< ERROR!
java.lang.NoSuchMethodError: 
org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V


[ERROR] TestHBaseIndex.testTotalGetsBatching()  Time elapsed: 1.771 s  <<< 
ERROR!
java.lang.NoSuchMethodError: 
org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V


[ERROR] TestHBaseIndex.testTotalPutsBatching()  Time elapsed: 0.082 s  <<< 
ERROR!
java.lang.NoSuchMethodError: 
org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V


34206 [Thread-260] WARN  
org.apache.hadoop.hdfs.server.datanode.DirectoryScanner  - DirectoryScanner: 
shutdown has been called
34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to 
localhost/127.0.0.1:55924] WARN  
org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager  - 
IncrementalBlockReportManager interrupted
34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to 
localhost/127.0.0.1:55924] WARN  
org.apache.hadoop.hdfs.server.datanode.DataNode  - Ending block pool service 
for: Block pool BP-1058834949-10.0.0.2-1597189606506 (Datanode Uuid 
cb7bd8aa-5d79-4955-b1ec-bdaf7f1b6431) service to localhost/127.0.0.1:55924
34246 
[refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data1/current/BP-1058834949-10.0.0.2-1597189606506]
 WARN  org.apache.hadoop.fs.CachingGetSpaceUsed  - Thread Interrupted waiting 
to refresh disk information: sleep interrupted
34247 
[refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data2/current/BP-1058834949-10.0.0.2-1597189606506]
 WARN  org.apache.hadoop.fs.CachingGetSpaceUsed  - Thread Interrupted waiting 
to refresh disk information: sleep interrupted
37192 [HBase-Metrics2-1] WARN  org.apache.hadoop.metrics2.impl.MetricsConfig  - 
Cannot locate configuration: tried 
hadoop-metrics2-datanode.properties,hadoop-metrics2.properties
43904 [master/iad1-ws-cor-r12:0:becomeActiveMaster-SendThread(localhost:58768)] 
WARN  org.apache.zookeeper.ClientCnxn  - Session 0x173dfeb0c8b0004 for server 
null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
[INFO] 
[INFO] Results:
[INFO] 
[ERROR] Errors: 
[ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
[ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
[ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
[ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
[ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
[ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
[ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
[INFO] 
[ERROR] Tests run: 10, Failures: 0, Errors: 7, Skipped: 0
[INFO] 
{code}
Basically currently Hudi and it's dependency Javalin depend on Jetty 9.4.x but 
Hbase depends on jetty 9.3.x. And they have incompatible APIs which could not 
be easily resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-1021) [Bug] Unable to update bootstrapped table using rows from the written bootstrapped table

2020-07-09 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding reassigned HUDI-1021:
--

Assignee: Wenning Ding  (was: Balaji Varadarajan)

> [Bug] Unable to update bootstrapped table using rows from the written 
> bootstrapped table
> 
>
> Key: HUDI-1021
> URL: https://issues.apache.org/jira/browse/HUDI-1021
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: bootstrap
>Reporter: Udit Mehrotra
>Assignee: Wenning Ding
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Reproduction Steps:
>  
> {code:java}
> import spark.implicits._
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.HoodieDataSourceHelpers
> import org.apache.hudi.common.model.HoodieTableType
> import org.apache.spark.sql.SaveMode
> val sourcePath = 
> "s3://uditme-iad/hudi/tables/events/events_data_partitioned_non_null"
> val sourceDf = spark.read.parquet(sourcePath + "/*")
> var tableName = "events_data_partitioned_non_null_00"
> var tablePath = "s3://emr-users/uditme/hudi/tables/events/" + tableName
> sourceDf.write.format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, tableName)
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>  .mode(SaveMode.Overwrite)
>  .save(tablePath)
> val readDf = spark.read.format("org.apache.hudi").load(tablePath + "/*")
> val updateDf = readDf.filter($"event_id" === "106")
>  .withColumn("event_name", lit("udit_event_106"))
>  
> updateDf.write.format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, tableName)
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>  .mode(SaveMode.Append)
>  .save(tablePath)
> {code}
>  
> Full Stack trace:
> {noformat}
> Caused by: org.apache.hudi.exception.HoodieUpsertException: Error upserting 
> bucketType UPDATE for partition :0
>  at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:276)
>  at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:102)
>  at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>  at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:359)
>  at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:357)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1181)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1155)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1090)
>  at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1155)
>  at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:881)
>  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:308)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>  at org.apache.spark.scheduler.Task.

[jira] [Created] (HUDI-1061) Hudi CLI savepoint command fail because of spark conf loading issue

2020-06-29 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-1061:
--

 Summary: Hudi CLI savepoint command fail because of spark conf 
loading issue
 Key: HUDI-1061
 URL: https://issues.apache.org/jira/browse/HUDI-1061
 Project: Apache Hudi
  Issue Type: Bug
  Components: CLI
Reporter: Wenning Ding


h3. Reproduce

open hudi-cli.sh and run these two commands:
{code:java}
connect --path s3://wenningd-emr-dev/hudi/tables/events/hudi_null01 savepoint 
create --commit 2019115109 
{code}
{{}}

{{}}You would see this error:

{{}}
{code:java}
java.io.FileNotFoundException: File file:/tmp/spark-events does not exist at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:640)
 at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:866)
 at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:630)
 at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:452) 
at 
org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:97)
 at org.apache.spark.SparkContext.(SparkContext.scala:523) at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) at 
org.apache.hudi.cli.utils.SparkUtil.initJavaSparkConf(SparkUtil.java:85) at 
org.apache.hudi.cli.commands.SavepointsCommand.savepoint(SavepointsCommand.java:79)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:216) 
at 
org.springframework.shell.core.SimpleExecutionStrategy.invoke(SimpleExecutionStrategy.java:68)
 at 
org.springframework.shell.core.SimpleExecutionStrategy.execute(SimpleExecutionStrategy.java:59)
 at 
org.springframework.shell.core.AbstractShell.executeCommand(AbstractShell.java:134)
 at org.springframework.shell.core.JLineShell.promptLoop(JLineShell.java:533) 
at org.springframework.shell.core.JLineShell.run(JLineShell.java:179) at 
java.lang.Thread.run(Thread.java:748){code}
{{}}Although in {{spark-defaults.conf}}, it configs {{spark.eventLog.dir
   hdfs:///var/log/spark/apps}}, but here hudi cli still uses 
{{file:/tmp/spark-events}} as the event log dir, which means sparkcontext won't 
load the configs from {{spark-defaults.conf}}.

We should make initJavaSparkConf method be able to read configs from spark 
config file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-949) Test MOR : Hive Realtime Query with metadata bootstrap

2020-06-24 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding resolved HUDI-949.
---
Resolution: Fixed

> Test MOR : Hive Realtime Query with metadata bootstrap
> --
>
> Key: HUDI-949
> URL: https://issues.apache.org/jira/browse/HUDI-949
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Balaji Varadarajan
>Assignee: Wenning Ding
>Priority: Major
>  Time Spent: 72h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-950) Test COW : Spark SQL Read Optimized Query with metadata bootstrap

2020-06-24 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding resolved HUDI-950.
---
Resolution: Fixed

> Test COW : Spark SQL Read Optimized Query with metadata bootstrap
> -
>
> Key: HUDI-950
> URL: https://issues.apache.org/jira/browse/HUDI-950
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Wenning Ding
>Priority: Major
>  Time Spent: 72h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-971) Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name

2020-05-27 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-971:
-

 Summary: Fix HFileBootstrapIndexReader.getIndexedPartitions() 
returns unclean partition name
 Key: HUDI-971
 URL: https://issues.apache.org/jira/browse/HUDI-971
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Wenning Ding


When calling HFileBootstrapIndexReader.getIndexedPartitions(), it will return 
unclean partitions because of 
[https://github.com/apache/hbase/blob/rel/1.2.3/hbase-common/src/main/java/org/apache/hadoop/hbase/CellUtil.java#L768].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-934) Hive query does not work with realtime table which contain decimal type

2020-05-23 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-934:
-

 Summary: Hive query does not work with realtime table which 
contain decimal type
 Key: HUDI-934
 URL: https://issues.apache.org/jira/browse/HUDI-934
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Wenning Ding


h3. Issue

After updating a MOR table with decimal type, Hive query would fail because of 
a type cast exception.

The bug looks like is because the decimal type is not correctly handled in 
here: 
https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java#L303
h3. Reproduction steps

Create a MOR table with decimal type
{code:java}
import org.apache.spark.sql.types._
import spark.implicits._
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.common.model.HoodieTableType
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.types.{StructType, StructField, StringType, 
IntegerType}
import org.apache.spark.sql.Row 
import org.apache.spark.sql.functions
import java.util.Date
import org.apache.spark.sql.DataFrame

var df = Seq(
  (100, "event_name_16", "2015-01-01T13:51:39.340396Z", 1.32, "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", 2.57, "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", 3.45, "type1"),
  (105, "event_name_678", "2015-01-01T13:51:42.248818Z", 6.78, "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_value", "event_type")
  
df = df.withColumn("event_value", df.col("event_value").cast(DecimalType(4,2)))

val tableType = HoodieTableType.MERGE_ON_READ.name
val tableName = "test8"
val tablePath = "s3://xxx/hudi/tables/" + tableName + "/"

df.write.format("org.apache.hudi")
   .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, tableType)
   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
   .option(HoodieWriteConfig.TABLE_NAME, tableName)
   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
   .option("hoodie.compact.inline", "false")
   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
   .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
   .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, "default")
   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
   .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
"org.apache.hudi.keygen.ComplexKeyGenerator")
   .mode(SaveMode.Overwrite)
   .save(tablePath)
{code}
Update this table w/o inline compaction:
{code:java}
var update_df = Seq(
  (100, "event_name_16", "2015-01-01T13:51:39.340396Z", 9.00, "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", 2.57, "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", 8.00, "type1"),
  (105, "event_name_678", "2015-01-01T13:51:42.248818Z", 6.78, "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_value", "event_type")
  
update_df = update_df.withColumn("event_value", 
update_df.col("event_value").cast(DecimalType(4,2)))

update_df.write.format("org.apache.hudi")
 .option(HoodieWriteConfig.TABLE_NAME, tableName)
 .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
 .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, tableType)
 .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
 .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
"event_ts")
 .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
"event_type")
 .option("hoodie.compact.inline", "false")
 .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, 
"true")
 .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
 .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
 .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, "default")
 .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
"event_type")
 
.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
 .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
"org.apache.hudi.keygen.ComplexKeyGenerator")
 .mode(SaveMode.Append)
 .save(tablePath)
{code}
Query _rt table with hive
{code:java}
hive>  select * from test8_rt;
{code}
Get this error
{code:java}
Failed with

[jira] [Updated] (HUDI-713) Datasource Writer throws error on resolving array of struct fields

2020-03-14 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-713:
--
Description: 
Similar to [https://issues.apache.org/jira/browse/HUDI-530]. With migration of 
Hudi to spark 2.4.4 and using Spark's native spark-avro module, this issue now 
exists in Hudi master.

Reproduce steps:

Run following script
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.hive.MultiPartKeysValueExtractor
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import spark.implicits._

val sample = """
[{
  "partition": 0,
  "offset": 5,
  "timestamp": "1581508884",
  "value": {
"prop1": "val1",
"prop2": [{"withinProp1": "val2", "withinProp2": 1}]
  }
}, {
  "partition": 1,
  "offset": 10,
  "timestamp": "1581108884",
  "value": {
"prop1": "val4",
"prop2": [{"withinProp1": "val5", "withinProp2": 2}]
  }
}]
"""

val df = spark.read.option("dropFieldIfAllNull", "true").json(Seq(sample).toDS)
val dfcol1 = df.withColumn("op_ts", from_unixtime(col("timestamp")))
val dfcol2 = dfcol1.withColumn("year_partition", 
year(col("op_ts"))).withColumn("id", concat($"partition", lit("-"), $"offset"))
val dfcol3 = dfcol2.drop("timestamp")
val hudiOptions: Map[String, String] =
Map[String, String](
DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY -> "test",
DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> 
DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL,
DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL,
DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "op_ts",
DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> 
classOf[MultiPartKeysValueExtractor].getName,
"hoodie.parquet.max.file.size" -> String.valueOf(1024 * 1024 * 
1024),
"hoodie.parquet.compression.ratio" -> String.valueOf(0.5),
"hoodie.insert.shuffle.parallelism" -> String.valueOf(2)
  )

dfcol3.write.format("org.apache.hudi")
  .options(hudiOptions)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "id")
  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
"year_partition")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
"year_partition")
  .option(HoodieWriteConfig.TABLE_NAME, "AWS_TEST")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, "AWS_TEST")
  .mode(SaveMode.Append).save("s3://xxx/AWS_TEST/")
{code}
Will throw not in union exception:
{code:java}
Caused by: org.apache.avro.UnresolvedUnionException: Not in union 
[{"type":"record","name":"prop2","namespace":"hoodie.AWS_TEST.AWS_TEST_record.value","fields":[{"name":"withinProp1","type":["string","null"]},{"name":"withinProp2","type":["long","null"]}]},"null"]:
 {"withinProp1": "val2", "withinProp2": 1}
{code}

  was:
Similar to [https://issues.apache.org/jira/browse/HUDI-530]. With migration of 
Hudi to spark 2.4.4 and using Spark's native spark-avro module, this issue now 
exists in Hudi master.

Reproduce steps:

Run following script
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.hive.MultiPartKeysValueExtractor
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import spark.implicits._val sample = """
[{
  "partition": 0,
  "offset": 5,
  "timestamp": "1581508884",
  "value": {
"prop1": "val1",
"prop2": [{"withinProp1": "val2", "withinProp2": 1}]
  }
}, {
  "partition": 1,
  "offset": 10,
  "timestamp": "1581108884",
  "value": {
"prop1": "val4",
"prop2": [{"withinProp1": "val5", "withinProp2": 2}]
  }
}]
"""val df = spark.read.option("dropFieldIfAllNull", 
"true").json(Seq(sample).toDS)
val dfcol1 = df.withColumn("op_ts", from_unixtime(col("timestamp")))
val dfcol2 = dfcol1.withColumn("year_partition", 
year(col("op_ts"))).withColumn("id", concat($"partition", lit("-"), $"offset"))
val dfcol3 = dfcol2.drop("timestamp")val hudiOptions: Map[String, String] =
Map[String, String](
DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY -> "test",
DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> 
DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL,
DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL,
DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "op_ts",
DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> 
classOf[MultiPartKeysValueExtractor].getName,

[jira] [Created] (HUDI-713) Datasource Writer throws error on resolving array of struct fields

2020-03-14 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-713:
-

 Summary: Datasource Writer throws error on resolving array of 
struct fields
 Key: HUDI-713
 URL: https://issues.apache.org/jira/browse/HUDI-713
 Project: Apache Hudi (incubating)
  Issue Type: Bug
Reporter: Wenning Ding


Similar to [https://issues.apache.org/jira/browse/HUDI-530]. With migration of 
Hudi to spark 2.4.4 and using Spark's native spark-avro module, this issue now 
exists in Hudi master.

Reproduce steps:

Run following script
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.hive.MultiPartKeysValueExtractor
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import spark.implicits._val sample = """
[{
  "partition": 0,
  "offset": 5,
  "timestamp": "1581508884",
  "value": {
"prop1": "val1",
"prop2": [{"withinProp1": "val2", "withinProp2": 1}]
  }
}, {
  "partition": 1,
  "offset": 10,
  "timestamp": "1581108884",
  "value": {
"prop1": "val4",
"prop2": [{"withinProp1": "val5", "withinProp2": 2}]
  }
}]
"""val df = spark.read.option("dropFieldIfAllNull", 
"true").json(Seq(sample).toDS)
val dfcol1 = df.withColumn("op_ts", from_unixtime(col("timestamp")))
val dfcol2 = dfcol1.withColumn("year_partition", 
year(col("op_ts"))).withColumn("id", concat($"partition", lit("-"), $"offset"))
val dfcol3 = dfcol2.drop("timestamp")val hudiOptions: Map[String, String] =
Map[String, String](
DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY -> "test",
DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> 
DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL,
DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL,
DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "op_ts",
DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> 
classOf[MultiPartKeysValueExtractor].getName,
"hoodie.parquet.max.file.size" -> String.valueOf(1024 * 1024 * 
1024),
"hoodie.parquet.compression.ratio" -> String.valueOf(0.5),
"hoodie.insert.shuffle.parallelism" -> String.valueOf(2)
  )dfcol3.write.format("org.apache.hudi")
  .options(hudiOptions)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "id")
  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
"year_partition")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
"year_partition")
  .option(HoodieWriteConfig.TABLE_NAME, "AWS_TEST")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, "AWS_TEST")
  .mode(SaveMode.Append).save("s3://xxx/AWS_TEST/")
{code}
Will throw not in union exception:
{code:java}
Caused by: org.apache.avro.UnresolvedUnionException: Not in union 
[{"type":"record","name":"prop2","namespace":"hoodie.AWS_TEST.AWS_TEST_record.value","fields":[{"name":"withinProp1","type":["string","null"]},{"name":"withinProp2","type":["long","null"]}]},"null"]:
 {"withinProp1": "val2", "withinProp2": 1}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-515) Resolve API conflict for Hive 2 and Hive 3

2020-01-31 Thread Wenning Ding (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027742#comment-17027742
 ] 

Wenning Ding commented on HUDI-515:
---

[~nishith29] Thanks for your advice. I realized just coping the code maybe is 
not a good option. Using java reflection to make it compatible with Hive 2&3 
looks a better way.

> Resolve API conflict for Hive 2 and Hive 3
> --
>
> Key: HUDI-515
> URL: https://issues.apache.org/jira/browse/HUDI-515
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Wenning Ding
>Priority: Major
>
> Currently I am working on supporting HIVE 3. There is an API issue.
> In *HoodieCombineHiveInputFormat.java*, it calls a Hive method: 
> HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is 
> removed in Hive 3 and replaced by 
> HiveFileFormatUtils.getFromPathRecursively().
> Ideally, Hudi should support both Hive 2 & Hive 3 so that both
> {code:java}
> mvn clean install{code}
> and
> {code:java}
> mvn clean install -Dhive.version=3.x{code}
> could work.
> One solution is to directly copy source code from 
> [HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java]
>  and put that code inside *HoodieCombineHiveInputFormat.java*.
> The other way is using java reflection to decide which method to use. 
> (preferred)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-515) Resolve API conflict for Hive 2 and Hive 3

2020-01-31 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-515:
--
Description: 
Currently I am working on supporting HIVE 3. There is an API issue.

In *HoodieCombineHiveInputFormat.java*, it calls a Hive method: 
HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is 
removed in Hive 3 and replaced by HiveFileFormatUtils.getFromPathRecursively().

Ideally, Hudi should support both Hive 2 & Hive 3 so that both
{code:java}
mvn clean install{code}
and
{code:java}
mvn clean install -Dhive.version=3.x{code}
could work.

One solution is to directly copy source code from 
[HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java]
 and put that code inside *HoodieCombineHiveInputFormat.java*.

The other way is using java reflection to decide which method to use. 
(preferred)

  was:
Currently I am working on supporting HIVE 3. There is an API issue.

In *HoodieCombineHiveInputFormat.java*, it calls a Hive method: 
HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is 
removed in Hive 3 and replaced by HiveFileFormatUtils.getFromPathRecursively().

Ideally, Hudi should support both Hive 2 & Hive 3 so that both
{code:java}
mvn clean install{code}
and
{code:java}
mvn clean install -Dhive.version=3.x{code}
could work.

One solution is to directly copy source code from 
[HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java]
 and put that code inside *HoodieCombineHiveInputFormat.java*.

The other way is use java reflection to decide which method to use. (preferred)


> Resolve API conflict for Hive 2 and Hive 3
> --
>
> Key: HUDI-515
> URL: https://issues.apache.org/jira/browse/HUDI-515
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Wenning Ding
>Priority: Major
>
> Currently I am working on supporting HIVE 3. There is an API issue.
> In *HoodieCombineHiveInputFormat.java*, it calls a Hive method: 
> HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is 
> removed in Hive 3 and replaced by 
> HiveFileFormatUtils.getFromPathRecursively().
> Ideally, Hudi should support both Hive 2 & Hive 3 so that both
> {code:java}
> mvn clean install{code}
> and
> {code:java}
> mvn clean install -Dhive.version=3.x{code}
> could work.
> One solution is to directly copy source code from 
> [HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java]
>  and put that code inside *HoodieCombineHiveInputFormat.java*.
> The other way is using java reflection to decide which method to use. 
> (preferred)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-515) Resolve API conflict for Hive 2 and Hive 3

2020-01-31 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-515:
--
Description: 
Currently I am working on supporting HIVE 3. There is an API issue.

In *HoodieCombineHiveInputFormat.java*, it calls a Hive method: 
HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is 
removed in Hive 3 and replaced by HiveFileFormatUtils.getFromPathRecursively().

Ideally, Hudi should support both Hive 2 & Hive 3 so that both
{code:java}
mvn clean install{code}
and
{code:java}
mvn clean install -Dhive.version=3.x{code}
could work.

One solution is to directly copy source code from 
[HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java]
 and put that code inside *HoodieCombineHiveInputFormat.java*.

The other way is use java reflection to decide which method to use. (preferred)

  was:
Currently I am working on supporting HIVE 3. There is an API issue.

In *HoodieCombineHiveInputFormat.java*, it calls a Hive method: 
HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is 
removed in Hive 3 and replaced by HiveFileFormatUtils.getFromPathRecursively().

Ideally, Hudi should support both Hive 2 & Hive 3 so that both
{code:java}
mvn clean install{code}
and
{code:java}
mvn clean install -Dhive.version=3.x{code}
could work.

One solution is to directly copy source code from 
[HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java]
 and put that code inside *HoodieCombineHiveInputFormat.java*.


> Resolve API conflict for Hive 2 and Hive 3
> --
>
> Key: HUDI-515
> URL: https://issues.apache.org/jira/browse/HUDI-515
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Wenning Ding
>Priority: Major
>
> Currently I am working on supporting HIVE 3. There is an API issue.
> In *HoodieCombineHiveInputFormat.java*, it calls a Hive method: 
> HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is 
> removed in Hive 3 and replaced by 
> HiveFileFormatUtils.getFromPathRecursively().
> Ideally, Hudi should support both Hive 2 & Hive 3 so that both
> {code:java}
> mvn clean install{code}
> and
> {code:java}
> mvn clean install -Dhive.version=3.x{code}
> could work.
> One solution is to directly copy source code from 
> [HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java]
>  and put that code inside *HoodieCombineHiveInputFormat.java*.
> The other way is use java reflection to decide which method to use. 
> (preferred)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-515) Resolve API conflict for Hive 2 and Hive 3

2020-01-09 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-515:
--
Description: 
Currently I am working on supporting HIVE 3. There is an API issue.

In *HoodieCombineHiveInputFormat.java*, it calls a Hive method: 
HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is 
removed in Hive 3 and replaced by HiveFileFormatUtils.getFromPathRecursively().

Ideally, Hudi should support both Hive 2 & Hive 3 so that both
{code:java}
mvn clean install{code}
and
{code:java}
mvn clean install -Dhive.version=3.x{code}
could work.

One solution is to directly copy source code from 
[HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java]
 and put that code inside *HoodieCombineHiveInputFormat.java*.

  was:
Currently I am working on supporting HIVE 3. There is an API issue.

In 
[HoodieCombineHiveInputFormat.java|https://code.amazon.com/packages/Aws157Hudi/blobs/a6517d4bfb2584a983342297d350bc9f6d0f38db/--/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java],
 it calls a Hive method: 
HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is 
removed in Hive 3 and replaced by HiveFileFormatUtils.getFromPathRecursively().


Ideally, Hudi should support both Hive 2 & Hive 3 so that both
{code:java}
mvn clean install{code}
and
{code:java}
mvn clean install -Dhive.version=3.x{code}
could work. 

One solution is to directly copy source code from 
[HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java]
 and put that code inside 
[HoodieCombineHiveInputFormat.java|https://code.amazon.com/packages/Aws157Hudi/blobs/a6517d4bfb2584a983342297d350bc9f6d0f38db/--/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java].


> Resolve API conflict for Hive 2 and Hive 3
> --
>
> Key: HUDI-515
> URL: https://issues.apache.org/jira/browse/HUDI-515
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Wenning Ding
>Priority: Major
>
> Currently I am working on supporting HIVE 3. There is an API issue.
> In *HoodieCombineHiveInputFormat.java*, it calls a Hive method: 
> HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is 
> removed in Hive 3 and replaced by 
> HiveFileFormatUtils.getFromPathRecursively().
> Ideally, Hudi should support both Hive 2 & Hive 3 so that both
> {code:java}
> mvn clean install{code}
> and
> {code:java}
> mvn clean install -Dhive.version=3.x{code}
> could work.
> One solution is to directly copy source code from 
> [HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java]
>  and put that code inside *HoodieCombineHiveInputFormat.java*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-515) Resolve API conflict for Hive 2 and Hive 3

2020-01-09 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-515:
-

 Summary: Resolve API conflict for Hive 2 and Hive 3
 Key: HUDI-515
 URL: https://issues.apache.org/jira/browse/HUDI-515
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
Reporter: Wenning Ding


Currently I am working on supporting HIVE 3. There is an API issue.

In 
[HoodieCombineHiveInputFormat.java|https://code.amazon.com/packages/Aws157Hudi/blobs/a6517d4bfb2584a983342297d350bc9f6d0f38db/--/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java],
 it calls a Hive method: 
HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is 
removed in Hive 3 and replaced by HiveFileFormatUtils.getFromPathRecursively().


Ideally, Hudi should support both Hive 2 & Hive 3 so that both
{code:java}
mvn clean install{code}
and
{code:java}
mvn clean install -Dhive.version=3.x{code}
could work. 

One solution is to directly copy source code from 
[HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java]
 and put that code inside 
[HoodieCombineHiveInputFormat.java|https://code.amazon.com/packages/Aws157Hudi/blobs/a6517d4bfb2584a983342297d350bc9f6d0f38db/--/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-495) Update deprecated HBase API

2020-01-03 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-495:
-

 Summary: Update deprecated HBase API
 Key: HUDI-495
 URL: https://issues.apache.org/jira/browse/HUDI-495
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: Wenning Ding


Internally we are using HBase 2.x, and HBase 2.x no longer supports 
_*Htable.flushCommits()*_ and it is replaced by _*BufferedMutator.flush()*_.
 Thus for put operation and delete operation, we can use BufferedMutator 
instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-259) Hadoop 3 support for Hudi writing

2019-12-17 Thread Wenning Ding (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998480#comment-16998480
 ] 

Wenning Ding commented on HUDI-259:
---

Hey [~Pratyaksh], I am also working on hadoop 3 support for Hudi. After I using 
Hadoop 3.x and Hive 3.x. The unit tests for hudi-hive module fail when they 
trying to start hive metastore and hiveserver2, are you facing the same issue?

> Hadoop 3 support for Hudi writing
> -
>
> Key: HUDI-259
> URL: https://issues.apache.org/jira/browse/HUDI-259
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Vinoth Chandar
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> Sample issues
>  
> [https://github.com/apache/incubator-hudi/issues/735]
> [https://github.com/apache/incubator-hudi/issues/877#issuecomment-528433568] 
> [https://github.com/apache/incubator-hudi/issues/898]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-353) Add support for Hive style partitioning path

2019-11-20 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-353:
-

 Summary: Add support for Hive style partitioning path
 Key: HUDI-353
 URL: https://issues.apache.org/jira/browse/HUDI-353
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: Wenning Ding


In Hive, the partition folder name follows this format: 
=.
But in Hudi, the name of its partition folder is .

e.g. A dataset is partitioned by three columns: year, month and day.
In Hive, the data is saved in: 
{{...//year=2019/month=05/day=01/xxx.parquet}}
In Hudi, the data is saved in: {{...//2019/05/01/xxx.parquet}}

Basically I add a new option in Spark datasource named 
{{HIVE_STYLE_PARTITIONING_FILED_OPT_KEY}} which indicates whether using hive 
style partitioning or not. By default this option is false (not use).

Also, if using hive style partitioning, instead of scanning the dataset and 
manually adding/updating all partitions, we can use "MSCK REPAIR TABLE 
" to automatically sync all the partition info with Hive MetaStore.
h3.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-325) Unable to query by Hive after updating HDFS Hudi table

2019-11-07 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-325:
--
Description: 
h3. Description

While doing internal testing in EMR, we found that if Hudi table path follows 
this kind of format: hdfs:///user/... or hdfs:/user/... then Hudi table would 
unable to query by Hive after updating.
h3. Reproduction
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.spark.sql.SaveModeval 

df = Seq(
  (100, "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
  (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

var tableName = "hudi_test"
var tablePath = "hdfs:///user/hadoop/" + tableName

// write hudi dataset
df.write.format("org.apache.hudi")
  .option("hoodie.upsert.shuffle.parallelism", "2")
  .option(HoodieWriteConfig.TABLE_NAME, tableName)
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
  .mode(SaveMode.Overwrite)
  .save(tablePath)

// update hudi dataset
val df2 = Seq(
  (100, "event_name_1", "2015-01-01T13:51:39.340396Z", "type1"),
  (107, "event_name_578", "2015-01-01T13:51:42.248818Z", "type3")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

df2.write.format("org.apache.hudi")
   .option("hoodie.upsert.shuffle.parallelism", "2")
   .option(HoodieWriteConfig.TABLE_NAME, tableName)
   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
   .mode(SaveMode.Append)
   .save(tablePath)
{code}
Then do query in Hive:
{code:java}
select count(*) from hudi_test;
{code}
It returns: 
{code:java}
java.io.IOException: cannot find dir = 
hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/elb_logs_hudi_cow_8/2015-01-01/cb7531ac-dadf-4118-b722-55cb34bc66f2-0_34-7-336_20191104223321.parquet
 in pathToPartitionInfo: [hdfs:/user/hadoop/elb_logs_hudi_cow_8/2015-01-01]
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:394)
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:357)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:284)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:184)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:161)
at 
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:207)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
at 
org.apache.tez.dag.app.dag.RootInputI

[jira] [Updated] (HUDI-325) Unable to query by Hive after updating HDFS Hudi table

2019-11-07 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-325:
--
Description: 
h3. Description

While doing internal testing in EMR, we found that if Hudi table path follows 
this kind of format: hdfs:///user/... or hdfs:/user/... then Hudi table would 
unable to query by Hive after updating.
h3. Reproduction
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.spark.sql.SaveModeval 

df = Seq(
  (100, "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
  (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

var tableName = "hudi_test"
var tablePath = "hdfs:///user/hadoop/" + tableName

// write hudi dataset
df.write.format("org.apache.hudi")
  .option("hoodie.upsert.shuffle.parallelism", "2")
  .option(HoodieWriteConfig.TABLE_NAME, tableName)
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
  .mode(SaveMode.Overwrite)
  .save(tablePath)

// update hudi dataset
val df2 = Seq(
  (100, "event_name_1", "2015-01-01T13:51:39.340396Z", "type1"),
  (107, "event_name_578", "2015-01-01T13:51:42.248818Z", "type3")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

df2.write.format("org.apache.hudi")
   .option("hoodie.upsert.shuffle.parallelism", "2")
   .option(HoodieWriteConfig.TABLE_NAME, tableName)
   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
   .mode(SaveMode.Append)
   .save(tablePath)
{code}
Then do query in Hive:
{code:java}
select count(*) from hudi_test;
{code}
It returns: 
{code:java}
java.io.IOException: cannot find dir = 
hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/elb_logs_hudi_cow_8/2015-01-01/cb7531ac-dadf-4118-b722-55cb34bc66f2-0_34-7-336_20191104223321.parquet
 in pathToPartitionInfo: [hdfs:/user/hadoop/elb_logs_hudi_cow_8/2015-01-01]
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:394)
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:357)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:284)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:184)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:161)
at 
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:207)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
at 
org.apache.tez.dag.app.dag.RootInputI

[jira] [Updated] (HUDI-325) Unable to query by Hive after updating HDFS Hudi table

2019-11-07 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-325:
--
Description: 
h3. Description

While doing internal testing in EMR, we found that if Hudi table path follows 
this kind of format: hdfs:///user/... or hdfs:/user/... then Hudi table would 
unable to query by Hive after updating.
h3. Reproduction
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.spark.sql.SaveModeval 

df = Seq(
  (100, "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
  (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

var tableName = "hudi_test"
var tablePath = "hdfs:///user/hadoop/" + tableName

// write hudi dataset
df.write.format("org.apache.hudi")
  .option("hoodie.upsert.shuffle.parallelism", "2")
  .option(HoodieWriteConfig.TABLE_NAME, tableName)
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
  .mode(SaveMode.Overwrite)
  .save(tablePath)

// update hudi dataset
val df2 = Seq(
  (100, "event_name_1", "2015-01-01T13:51:39.340396Z", "type1"),
  (107, "event_name_578", "2015-01-01T13:51:42.248818Z", "type3")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

df2.write.format("org.apache.hudi")
   .option("hoodie.upsert.shuffle.parallelism", "2")
   .option(HoodieWriteConfig.TABLE_NAME, tableName)
   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
   .mode(SaveMode.Append)
   .save(tablePath)
{code}
Then do query in Hive:
{code:java}
select count(*) from hudi_test;
{code}
It returns: 
{code:java}
java.io.IOException: cannot find dir = 
hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/elb_logs_hudi_cow_8/2015-01-01/cb7531ac-dadf-4118-b722-55cb34bc66f2-0_34-7-336_20191104223321.parquet
 in pathToPartitionInfo: [hdfs:/user/hadoop/elb_logs_hudi_cow_8/2015-01-01]
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:394)
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:357)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:284)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:184)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:161)
at 
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:207)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
at 
org.apache.tez.dag.app.dag.RootInputI

[jira] [Updated] (HUDI-325) Unable to query by Hive after updating HDFS Hudi table

2019-11-07 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-325:
--
Description: 
h3. Description

While doing internal testing in EMR, we found that if Hudi table path follows 
this kind of format: hdfs:///user/... or hdfs:/user/... then Hudi table would 
unable to query by Hive after updating.
h3. Reproduction
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.spark.sql.SaveModeval 

df = Seq(
  (100, "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
  (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

var tableName = "hudi_test"
var tablePath = "hdfs:///user/hadoop/" + tableName

// write hudi dataset
df.write.format("org.apache.hudi")
  .option("hoodie.upsert.shuffle.parallelism", "2")
  .option(HoodieWriteConfig.TABLE_NAME, tableName)
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
  .mode(SaveMode.Overwrite)
  .save(tablePath)

// update hudi dataset
val df2 = Seq(
  (100, "event_name_1", "2015-01-01T13:51:39.340396Z", "type1"),
  (107, "event_name_578", "2015-01-01T13:51:42.248818Z", "type3")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

df2.write.format("org.apache.hudi")
   .option("hoodie.upsert.shuffle.parallelism", "2")
   .option(HoodieWriteConfig.TABLE_NAME, tableName)
   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
   .mode(SaveMode.Append)
   .save(tablePath)
{code}
Then do query in Hive:
{code:java}
select count(*) from hudi_test;
{code}
It returns: 
{code:java}
java.io.IOException: cannot find dir = 
hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/elb_logs_hudi_cow_8/2015-01-01/cb7531ac-dadf-4118-b722-55cb34bc66f2-0_34-7-336_20191104223321.parquet
 in pathToPartitionInfo: [hdfs:/user/hadoop/elb_logs_hudi_cow_8/2015-01-01]
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:394)
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:357)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:284)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:184)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:161)
at 
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:207)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
at 
org.apache.tez.dag.app.dag.RootInputI

[jira] [Created] (HUDI-325) Unable to query by Hive after updating HDFS Hudi table

2019-11-07 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-325:
-

 Summary: Unable to query by Hive after updating HDFS Hudi table
 Key: HUDI-325
 URL: https://issues.apache.org/jira/browse/HUDI-325
 Project: Apache Hudi (incubating)
  Issue Type: Bug
Reporter: Wenning Ding


h3. Description

While doing internal testing in EMR, we found that if Hudi table path follows 
this kind of format: hdfs:///user/... or hdfs:/user/... then Hudi table would 
unable to query by Hive after updating.
h3. Reproduction

 
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.spark.sql.SaveModeval 

df = Seq(
  (100, "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
  (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

var tableName = "hudi_test"
var tablePath = "hdfs:///user/hadoop/" + tableName

// write hudi dataset
df.write.format("org.apache.hudi")
  .option("hoodie.upsert.shuffle.parallelism", "2")
  .option(HoodieWriteConfig.TABLE_NAME, tableName)
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
  .mode(SaveMode.Overwrite)
  .save(tablePath)

// update hudi dataset
val df2 = Seq(
  (100, "event_name_1", "2015-01-01T13:51:39.340396Z", "type1"),
  (107, "event_name_578", "2015-01-01T13:51:42.248818Z", "type3")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

df2.write.format("org.apache.hudi")
   .option("hoodie.upsert.shuffle.parallelism", "2")
   .option(HoodieWriteConfig.TABLE_NAME, tableName)
   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
   .mode(SaveMode.Append)
   .save(tablePath)
{code}
Then do query in Hive:

 

 
{code:java}
select count(*) from hudi_test;
{code}
It returns:

 

 
{code:java}
java.io.IOException: cannot find dir = 
hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/elb_logs_hudi_cow_8/2015-01-01/cb7531ac-dadf-4118-b722-55cb34bc66f2-0_34-7-336_20191104223321.parquet
 in pathToPartitionInfo: [hdfs:/user/hadoop/elb_logs_hudi_cow_8/2015-01-01]
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:394)
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:357)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:284)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:184)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:161)
at 
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:207)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at 
org.apac

[jira] [Resolved] (HUDI-313) Unable to SELECT COUNT(*) from a MOR realtime table

2019-11-06 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding resolved HUDI-313.
---
Resolution: Fixed

> Unable to SELECT COUNT(*) from a MOR realtime table
> ---
>
> Key: HUDI-313
> URL: https://issues.apache.org/jira/browse/HUDI-313
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> While I query like this in Hive:
> {code:java}
> SELECT COUNT(*) FROM hudi_test_rt;
> OR:
> SELECT COUNT(1) FROM hudi_test_rt;{code}
> It returns:
> {code:java}
> 2019-10-21 17:38:27,895 [ERROR] [TezChild] |tez.TezProcessor|: 
> java.lang.RuntimeException: java.io.IOException: 
> java.lang.NumberFormatException: For input string: ""
> at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
> at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152)
> at 
> org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
> at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)
> at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
> at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: java.lang.NumberFormatException: For input 
> string: ""
> at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
> at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
> at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379)
> at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
> ... 19 more
> Caused by: java.lang.NumberFormatException: For input string: ""
> at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> at java.lang.Integer.parseInt(Integer.java:592)
> at java.lang.Integer.parseInt(Integer.java:615)
> at 
> org.apache.hadoop.hive.serde2.ColumnProjectionUtils.getReadColumnIDs(ColumnProjectionUtils.java:186)
> at 
> org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:377)
> at 
> org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84)
> at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:75)
> at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
> at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
> at 
> org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:197)
> at 
> org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:222)
> at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376)
> ... 20 more
> {code}
> Some investigations:
> Basically, Hive try to update projection column ids during each 
> getRecordReader stage. But for CO

[jira] [Resolved] (HUDI-314) Unable to query a multi-partitions MOR realtime table

2019-11-06 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding resolved HUDI-314.
---
Resolution: Fixed

> Unable to query a multi-partitions MOR realtime table
> -
>
> Key: HUDI-314
> URL: https://issues.apache.org/jira/browse/HUDI-314
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> h3. Description
> I created a Hudi MOR table with multiple partition keys. The partition keys 
> are "year", "month" and "day".
> While I try to query its realtime time in Hive like this:
> {code:java}
> SELECT * FROM hudi_multi_partitions_test_rt;
> {code}
> It returns:
> {code:java}
> java.lang.Exception: java.io.IOException: 
> org.apache.avro.SchemaParseException: Illegal character in: year/month/day
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489) 
> ~[hadoop-mapreduce-client-common-2.8.4.jar:?]
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:549) 
> ~[hadoop-mapreduce-client-common-2.8.4.jar:?]
> Caused by: java.io.IOException: org.apache.avro.SchemaParseException: Illegal 
> character in: year/month/day
> at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
>  ~[hive-exec-2.3.3.jar:2.3.3]
> at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
>  ~[hive-exec-2.3.3.jar:2.3.3]
> at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379)
>  ~[hive-exec-2.3.3.jar:2.3.3]
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:169) 
> ~[hadoop-mapreduce-client-core-2.8.4.jar:?]
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) 
> ~[hadoop-mapreduce-client-core-2.8.4.jar:?]
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) 
> ~[hadoop-mapreduce-client-core-2.8.4.jar:?]
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270)
>  ~[hadoop-mapreduce-client-common-2.8.4.jar:?]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[?:1.8.0_212]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_212]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_212]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_212]
> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
> Caused by: org.apache.avro.SchemaParseException: Illegal character in: 
> year/month/day
> at org.apache.avro.Schema.validateName(Schema.java:1083) 
> ~[avro-1.7.7.jar:1.7.7]
> at org.apache.avro.Schema.access$200(Schema.java:79) 
> ~[avro-1.7.7.jar:1.7.7]
> at org.apache.avro.Schema$Field.(Schema.java:372) 
> ~[avro-1.7.7.jar:1.7.7]
> at org.apache.avro.Schema$Field.(Schema.java:367) 
> ~[avro-1.7.7.jar:1.7.7]
> at 
> org.apache.hudi.common.util.HoodieAvroUtils.appendNullSchemaFields(HoodieAvroUtils.java:166)
>  ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
> at 
> org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.addPartitionFields(AbstractRealtimeRecordReader.java:305)
>  ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
> at 
> org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:328)
>  ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
> at 
> org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(AbstractRealtimeRecordReader.java:103)
>  ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
> at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:48)
>  ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
> at 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:67)
>  ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
> at 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:45)
>  ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
> at 
> org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:233)
>  ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
> at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376)
>  ~[hive-exec-2.3.3.jar:2.3.3]
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.j

[jira] [Created] (HUDI-314) Unable to query a multi-partitions MOR realtime table

2019-10-25 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-314:
-

 Summary: Unable to query a multi-partitions MOR realtime table
 Key: HUDI-314
 URL: https://issues.apache.org/jira/browse/HUDI-314
 Project: Apache Hudi (incubating)
  Issue Type: Bug
Reporter: Wenning Ding


h3. Description

I created a Hudi MOR table with multiple partition keys. The partition keys are 
"year", "month" and "day".
While I try to query its realtime time in Hive like this:
{code:java}
SELECT * FROM hudi_multi_partitions_test_rt;
{code}
It returns:
{code:java}
java.lang.Exception: java.io.IOException: org.apache.avro.SchemaParseException: 
Illegal character in: year/month/day
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489) 
~[hadoop-mapreduce-client-common-2.8.4.jar:?]
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:549) 
~[hadoop-mapreduce-client-common-2.8.4.jar:?]
Caused by: java.io.IOException: org.apache.avro.SchemaParseException: Illegal 
character in: year/month/day
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
 ~[hive-exec-2.3.3.jar:2.3.3]
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
 ~[hive-exec-2.3.3.jar:2.3.3]
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379)
 ~[hive-exec-2.3.3.jar:2.3.3]
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:169) 
~[hadoop-mapreduce-client-core-2.8.4.jar:?]
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) 
~[hadoop-mapreduce-client-core-2.8.4.jar:?]
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) 
~[hadoop-mapreduce-client-core-2.8.4.jar:?]
at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270)
 ~[hadoop-mapreduce-client-common-2.8.4.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[?:1.8.0_212]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_212]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_212]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_212]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
Caused by: org.apache.avro.SchemaParseException: Illegal character in: 
year/month/day
at org.apache.avro.Schema.validateName(Schema.java:1083) 
~[avro-1.7.7.jar:1.7.7]
at org.apache.avro.Schema.access$200(Schema.java:79) ~[avro-1.7.7.jar:1.7.7]
at org.apache.avro.Schema$Field.(Schema.java:372) 
~[avro-1.7.7.jar:1.7.7]
at org.apache.avro.Schema$Field.(Schema.java:367) 
~[avro-1.7.7.jar:1.7.7]
at 
org.apache.hudi.common.util.HoodieAvroUtils.appendNullSchemaFields(HoodieAvroUtils.java:166)
 ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
at 
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.addPartitionFields(AbstractRealtimeRecordReader.java:305)
 ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
at 
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:328)
 ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
at 
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(AbstractRealtimeRecordReader.java:103)
 ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:48)
 ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:67)
 ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:45)
 ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
at 
org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:233)
 ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376)
 ~[hive-exec-2.3.3.jar:2.3.3]
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:169) 
~[hadoop-mapreduce-client-core-2.8.4.jar:?]
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) 
~[hadoop-mapreduce-client-core-2.8.4.jar:?]
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) 
~[hadoop-mapreduce-client-core-2.8.4.jar:?]
at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270)
 ~[hadoop-mapreduce-client-common-2.8.4.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:51

[jira] [Updated] (HUDI-313) Unable to SELECT COUNT(*) from a MOR realtime table

2019-10-23 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-313:
--
Description: 
While I query like this in Hive:
{code:java}
SELECT COUNT(*) FROM hudi_test_rt;
OR:
SELECT COUNT(1) FROM hudi_test_rt;{code}
It returns:
{code:java}
2019-10-21 17:38:27,895 [ERROR] [TezChild] |tez.TezProcessor|: 
java.lang.RuntimeException: java.io.IOException: 
java.lang.NumberFormatException: For input string: ""
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152)
at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.lang.NumberFormatException: For input 
string: ""
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
... 19 more
Caused by: java.lang.NumberFormatException: For input string: ""
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:592)
at java.lang.Integer.parseInt(Integer.java:615)
at 
org.apache.hadoop.hive.serde2.ColumnProjectionUtils.getReadColumnIDs(ColumnProjectionUtils.java:186)
at 
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:377)
at 
org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84)
at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:75)
at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
at 
org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:197)
at 
org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:222)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376)
... 20 more
{code}
Some investigations:

Basically, Hive try to update projection column ids during each getRecordReader 
stage. But for COUNT(\*) or COUNT(1), they don't need any projection column id 
which is an empty string.
 And for Hudi, to support compaction in MOR table, Hudi manually adds three 
Hudi required columns in the projection column ids and make the column ids like 
"2,0,3".
 Therefore, when Hive trying to update projection column ids, it combines an 
empty string with Hudi required columns ids and finally get the column ids like 
",2,0,3". This first comma will cause an error during the parsing stage.

 

One possible solution is to add a method to check if the projection column ids 
start with comma. If it is start with comma, then remove

[jira] [Updated] (HUDI-313) Unable to SELECT COUNT(*) from a MOR realtime table

2019-10-23 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-313:
--
Description: 
While I query like this in Hive:
{code:java}
SELECT COUNT(*) FROM hudi_test_rt;
OR:
SELECT COUNT(1) FROM hudi_test_rt;{code}
It returns:
{code:java}
2019-10-21 17:38:27,895 [ERROR] [TezChild] |tez.TezProcessor|: 
java.lang.RuntimeException: java.io.IOException: 
java.lang.NumberFormatException: For input string: ""
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152)
at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.lang.NumberFormatException: For input 
string: ""
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
... 19 more
Caused by: java.lang.NumberFormatException: For input string: ""
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:592)
at java.lang.Integer.parseInt(Integer.java:615)
at 
org.apache.hadoop.hive.serde2.ColumnProjectionUtils.getReadColumnIDs(ColumnProjectionUtils.java:186)
at 
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:377)
at 
org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84)
at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:75)
at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
at 
org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:197)
at 
org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:222)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376)
... 20 more
{code}
Some investigations:

Basically, Hive try to update projection column ids during each getRecordReader 
stage. But for COUNT(*) or COUNT(1), they don't need any projection column id 
which is an empty string.
 And for Hudi, to support compaction in MOR table, Hudi manually adds three 
Hudi required columns in the projection column ids and make the column ids like 
"2,0,3".
 Therefore, when Hive trying to update projection column ids, it combines an 
empty string with Hudi required columns ids and finally get the column ids like 
",2,0,3". This first comma will cause an error during the parsing stage.

 

One possible solution is to add a method to check if the projection column ids 
start with comma. If it is start with comma, then remove

[jira] [Updated] (HUDI-313) Unable SELECT COUNT(*) from a MOR realtime table

2019-10-23 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-313:
--
Description: 
While I query like this in Hive:
{code:java}
SELECT COUNT(*) FROM hudi_test_rt;
OR:
SELECT COUNT(1) FROM hudi_test_rt;{code}
It returns:
{code:java}
2019-10-21 17:38:27,895 [ERROR] [TezChild] |tez.TezProcessor|: 
java.lang.RuntimeException: java.io.IOException: 
java.lang.NumberFormatException: For input string: ""
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152)
at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.lang.NumberFormatException: For input 
string: ""
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
... 19 more
Caused by: java.lang.NumberFormatException: For input string: ""
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:592)
at java.lang.Integer.parseInt(Integer.java:615)
at 
org.apache.hadoop.hive.serde2.ColumnProjectionUtils.getReadColumnIDs(ColumnProjectionUtils.java:186)
at 
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:377)
at 
org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84)
at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:75)
at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
at 
org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:197)
at 
org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:222)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376)
... 20 more
{code}
Some investigations:

Basically, Hive try to update projection column ids during each getRecordReader 
stage. But for COUNT(*) or COUNT(1), they don't need any projection column id 
which is an empty string.
 And for Hudi, to support compaction in MOR table, Hudi manually adds three 
Hudi required columns in the projection column ids and make the column ids like 
"2,0,3".
 Therefore, when Hive trying to update projection column ids, it combines an 
empty string with Hudi required columns ids and finally get the column ids like 
",2,0,3". This first comma will cause an error during the parsing stage.

 

One possible solution is to add a method to check if the projection column ids 
start with comma. If it is start with comma, then remove

[jira] [Updated] (HUDI-313) Unable to SELECT COUNT(*) from a MOR realtime table

2019-10-23 Thread Wenning Ding (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-313:
--
Summary: Unable to SELECT COUNT(*) from a MOR realtime table  (was: Unable 
SELECT COUNT(*) from a MOR realtime table)

> Unable to SELECT COUNT(*) from a MOR realtime table
> ---
>
> Key: HUDI-313
> URL: https://issues.apache.org/jira/browse/HUDI-313
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>
> While I query like this in Hive:
> {code:java}
> SELECT COUNT(*) FROM hudi_test_rt;
> OR:
> SELECT COUNT(1) FROM hudi_test_rt;{code}
> It returns:
> {code:java}
> 2019-10-21 17:38:27,895 [ERROR] [TezChild] |tez.TezProcessor|: 
> java.lang.RuntimeException: java.io.IOException: 
> java.lang.NumberFormatException: For input string: ""
> at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
> at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152)
> at 
> org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
> at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)
> at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
> at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: java.lang.NumberFormatException: For input 
> string: ""
> at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
> at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
> at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379)
> at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
> ... 19 more
> Caused by: java.lang.NumberFormatException: For input string: ""
> at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> at java.lang.Integer.parseInt(Integer.java:592)
> at java.lang.Integer.parseInt(Integer.java:615)
> at 
> org.apache.hadoop.hive.serde2.ColumnProjectionUtils.getReadColumnIDs(ColumnProjectionUtils.java:186)
> at 
> org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:377)
> at 
> org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84)
> at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:75)
> at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
> at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
> at 
> org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:197)
> at 
> org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:222)
> at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376)
> ... 20 more
> {code}
> Some investigations:
> Basically, Hive try to update projection column ids during each 
> getRecordReader stage. But for CO

[jira] [Created] (HUDI-313) Unable SELECT COUNT(*) from a MOR realtime table

2019-10-23 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-313:
-

 Summary: Unable SELECT COUNT(*) from a MOR realtime table
 Key: HUDI-313
 URL: https://issues.apache.org/jira/browse/HUDI-313
 Project: Apache Hudi (incubating)
  Issue Type: Bug
Reporter: Wenning Ding


While I query like this in Hive:
{code:java}
SELECT COUNT(*) FROM hudi_test_rt;
OR:
SELECT COUNT(1) FROM hudi_test_rt;{code}
It returns:
{code:java}
2019-10-21 17:38:27,895 [ERROR] [TezChild] |tez.TezProcessor|: 
java.lang.RuntimeException: java.io.IOException: 
java.lang.NumberFormatException: For input string: ""
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152)
at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.lang.NumberFormatException: For input 
string: ""
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
... 19 more
Caused by: java.lang.NumberFormatException: For input string: ""
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:592)
at java.lang.Integer.parseInt(Integer.java:615)
at 
org.apache.hadoop.hive.serde2.ColumnProjectionUtils.getReadColumnIDs(ColumnProjectionUtils.java:186)
at 
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:377)
at 
org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84)
at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:75)
at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
at 
org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:197)
at 
org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:222)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376)
... 20 more
{code}
Some investigations:

Basically, Hive try to update projection column ids during each getRecordReader 
stage. But for COUNT(*) or COUNT(1), they don't need any projection column id 
which is an empty string.
And for Hudi, to support compaction in MOR table, Hudi manually adds three Hudi 
required columns in the projection column ids and make the column ids like 
"2,0,3".
Therefore, when Hive trying to update projection column ids, it combines an 
empty string with Hudi required columns ids and finally get the column ids like 
",2,0,3". This first comma will cause an error during the parsing stage.

[jira] [Created] (HUDI-301) Failed to update a non-partition MOR table

2019-10-10 Thread Wenning Ding (Jira)

Wenning Ding created HUDI-301:
-

 Summary: Failed to update a non-partition MOR table
 Key: HUDI-301
 URL: https://issues.apache.org/jira/browse/HUDI-301
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: Common Core
Reporter: Wenning Ding


We met this exception when trying to update a field for a non-partition MOR 
table.
{code:java}
org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
UPDATE for partition :0
at 
org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:273)
at 
org.apache.hudi.HoodieWriteClient.lambda$upsertRecordsInternal$507693af$1(HoodieWriteClient.java:457)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.IllegalArgumentException: Can not create a Path from an 
empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:163)
at org.apache.hadoop.fs.Path.(Path.java:175)
at org.apache.hadoop.fs.Path.(Path.java:110)
at 
org.apache.hudi.io.HoodieAppendHandle.init(HoodieAppendHandle.java:145)
at 
org.apache.hudi.io.HoodieAppendHandle.doAppend(HoodieAppendHandle.java:194)
at 
org.apache.hudi.table.HoodieMergeOnReadTable.handleUpdate(HoodieMergeOnReadTable.java:116)
at 
org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:265)
... 30 more
{code}
I have created a PR to solve this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 >

1 - 100 of 101 matches

Mail list logo