[jira] [Updated] (HUDI-1205) Serialization fail when log file is larger than 2GB

2020-08-19 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1205:
-
Status: Open  (was: New)

> Serialization fail when log file is larger than 2GB
> ---
>
> Key: HUDI-1205
> URL: https://issues.apache.org/jira/browse/HUDI-1205
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Priority: Major
>
> When scanning the log file, if the log file(or log file group) is larger than 
> 2GB, serialization will fail because Hudi uses Integer to store size in byte 
> for the log file. The maximum integer representing bytes is 2GB.
> Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 
> org.apache.hudi.common.model.OverwriteWithLatestAvroPayload$$Lambda$45/62103784
> Serialization trace:
> orderingVal (org.apache.hudi.common.model.OverwriteWithLatestAvroPayload)
> data (org.apache.hudi.common.model.HoodieRecord)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
> at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:118)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
> at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
> at 
> org.apache.hudi.common.util.SerializationUtils$KryoSerializerInstance.deserialize(SerializationUtils.java:107)
> at 
> org.apache.hudi.common.util.SerializationUtils.deserialize(SerializationUtils.java:81)
> at 
> org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:217)
> at 
> org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:211)
> at 
> org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:207)
> at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.get(ExternalSpillableMap.java:168)
> at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.get(ExternalSpillableMap.java:55)
> at 
> org.apache.hudi.HoodieMergeOnReadRDD$$anon$1.hasNext(HoodieMergeOnReadRDD.scala:128)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:624)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hudi.common.model.OverwriteWithLatestAvroPayload$$Lambda$45/62103784
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:154)
> ... 31 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1205) Serialization fail when log file is larger than 2GB

2020-08-19 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1205:
-
Description: 
When scanning the log file, if the log file(or log file group) is larger than 
2GB, serialization will fail because Hudi uses Integer to store size in byte 
for the log file. The maximum integer representing bytes is 2GB.

Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload$$Lambda$45/62103784
Serialization trace:
orderingVal (org.apache.hudi.common.model.OverwriteWithLatestAvroPayload)
data (org.apache.hudi.common.model.HoodieRecord)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:118)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
at 
org.apache.hudi.common.util.SerializationUtils$KryoSerializerInstance.deserialize(SerializationUtils.java:107)
at 
org.apache.hudi.common.util.SerializationUtils.deserialize(SerializationUtils.java:81)
at 
org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:217)
at 
org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:211)
at 
org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:207)
at 
org.apache.hudi.common.util.collection.ExternalSpillableMap.get(ExternalSpillableMap.java:168)
at 
org.apache.hudi.common.util.collection.ExternalSpillableMap.get(ExternalSpillableMap.java:55)
at 
org.apache.hudi.HoodieMergeOnReadRDD$$anon$1.hasNext(HoodieMergeOnReadRDD.scala:128)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:624)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: 
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload$$Lambda$45/62103784
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:154)
... 31 more

> Serialization fail when log file is larger than 2GB
> ---
>
> Key: HUDI-1205
> URL: https://issues.apache.org/jira/browse/HUDI-1205
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Priority: Major
>
> When scanning the log file, if the log file(or log file group) is larger than 
> 2GB, serialization will fail because Hudi uses Integer to store size in byte 
> for the log file. The maximum integer representing bytes is 2GB.
> Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 
> org.apache.hudi.common.model.OverwriteWithLatestAvroPayload$$Lambda$45/62103784
> Serialization trace:
> orderingVal (org.apache.hudi.common.model.OverwriteWithLatestAvroPayload)
> data (org.apache.hudi.common.model.HoodieRecord)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
> at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
> at 
> com.esotericsoftware.kryo.serial

[jira] [Created] (HUDI-1205) Serialization fail when log file is larger than 2GB

2020-08-19 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-1205:


 Summary: Serialization fail when log file is larger than 2GB
 Key: HUDI-1205
 URL: https://issues.apache.org/jira/browse/HUDI-1205
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Yanjia Gary Li






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-920) Incremental view on MOR table using Spark Datasource

2020-08-07 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173594#comment-17173594
 ] 

Yanjia Gary Li commented on HUDI-920:
-

The most challenging thing of the incremental query for MOR was how to handle 
the filter based on commit time.

The current implementation simply use the Spark dataframe API 
[https://github.com/apache/hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L122]

But for RDD, we couldn't use this approach anymore.

Based on my investigation, the approach I can come up with:
 * Use the PrunedFilteredScan and push the extra filter down. For this 
approach, we will have to

 ** 
sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.recordLevelFilter.enabled",
 "true")
sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.enableVectorizedReader",
 "false")
 ** Filter Avro records ourselves on the record reading level.
 ** Trick Spark to scan the whole file when the user input won't trigger the 
file scan like *df.count()*. If using the default ParquetReaderFunction, it 
will just read the metadata for the count and won't apply the filter.
 * Apply filter before the buildScan() return.
 ** We will have to ser/deser the UnsafeRow to get the _hoodie_commit_time._
 * Go to Spark Planing to see if it's possible to force the filter somewhere. 
 * Explore Data source V2 to see if there is better support.

All we need is to force a filter before return the result to the user, but I 
don't see this was support in data source V1. At this point, the first approach 
is the most reasonable one for me and I can make a PR soon.

[~vinoth] [~bhasudha] [~uditme] WDYT?

> Incremental view on MOR table using Spark Datasource
> 
>
> Key: HUDI-920
> URL: https://issues.apache.org/jira/browse/HUDI-920
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Blocker
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-69) Support realtime view in Spark datasource #136

2020-08-07 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-69.

Resolution: Fixed

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> [https://github.com/uber/hudi/issues/136]
> RFC: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]
> PR: [https://github.com/apache/incubator-hudi/pull/1592]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-69) Support realtime view in Spark datasource #136

2020-08-07 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reopened HUDI-69:


> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> [https://github.com/uber/hudi/issues/136]
> RFC: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]
> PR: [https://github.com/apache/incubator-hudi/pull/1592]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1051) Improve MOR datasource reader file listing and path handling

2020-08-07 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-1051.
--
Resolution: Fixed

> Improve MOR datasource reader file listing and path handling
> 
>
> Key: HUDI-1051
> URL: https://issues.apache.org/jira/browse/HUDI-1051
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-69) Support realtime view in Spark datasource #136

2020-08-07 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-69:
---
Status: Closed  (was: Patch Available)

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> [https://github.com/uber/hudi/issues/136]
> RFC: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]
> PR: [https://github.com/apache/incubator-hudi/pull/1592]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1052) Support vectorized reader for MOR datasource reader

2020-08-07 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-1052.
--
Resolution: Fixed

> Support vectorized reader for MOR datasource reader
> ---
>
> Key: HUDI-1052
> URL: https://issues.apache.org/jira/browse/HUDI-1052
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1050) Support filter pushdown and column pruning for MOR table on Spark Datasource

2020-08-07 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-1050.
--
Resolution: Fixed

> Support filter pushdown and column pruning for MOR table on Spark Datasource
> 
>
> Key: HUDI-1050
> URL: https://issues.apache.org/jira/browse/HUDI-1050
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> We need to use the information provided by PrunedFilteredScan to push down 
> the filter and column projection. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1052) Support vectorized reader for MOR datasource reader

2020-08-07 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1052:
-
Status: In Progress  (was: Open)

> Support vectorized reader for MOR datasource reader
> ---
>
> Key: HUDI-1052
> URL: https://issues.apache.org/jira/browse/HUDI-1052
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1141) Serialization fail when loading two log files

2020-07-31 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1141:
-
Summary: Serialization fail when loading two log files  (was: Serialization 
fail when loading large log files)

> Serialization fail when loading two log files
> -
>
> Key: HUDI-1141
> URL: https://issues.apache.org/jira/browse/HUDI-1141
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>
> [https://github.com/apache/hudi/issues/1890]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1141) Serialization fail when loading large log files

2020-07-31 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-1141:


 Summary: Serialization fail when loading large log files
 Key: HUDI-1141
 URL: https://issues.apache.org/jira/browse/HUDI-1141
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li


[https://github.com/apache/hudi/issues/1890]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1050) Support filter pushdown and column pruning for MOR table on Spark Datasource

2020-07-26 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1050:
-
Status: In Progress  (was: Open)

> Support filter pushdown and column pruning for MOR table on Spark Datasource
> 
>
> Key: HUDI-1050
> URL: https://issues.apache.org/jira/browse/HUDI-1050
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> We need to use the information provided by PrunedFilteredScan to push down 
> the filter and column projection. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1120) Support spotless for scala

2020-07-22 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1120:
-
Component/s: Code Cleanup

> Support spotless for scala
> --
>
> Key: HUDI-1120
> URL: https://issues.apache.org/jira/browse/HUDI-1120
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1120) Support spotless for scala

2020-07-22 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1120:
-
Fix Version/s: 0.6.0

> Support spotless for scala
> --
>
> Key: HUDI-1120
> URL: https://issues.apache.org/jira/browse/HUDI-1120
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1120) Support spotless for scala

2020-07-22 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1120:
-
Status: In Progress  (was: Open)

> Support spotless for scala
> --
>
> Key: HUDI-1120
> URL: https://issues.apache.org/jira/browse/HUDI-1120
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1120) Support spotless for scala

2020-07-22 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1120:
-
Status: Open  (was: New)

> Support spotless for scala
> --
>
> Key: HUDI-1120
> URL: https://issues.apache.org/jira/browse/HUDI-1120
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1120) Support spotless for scala

2020-07-22 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-1120:


 Summary: Support spotless for scala
 Key: HUDI-1120
 URL: https://issues.apache.org/jira/browse/HUDI-1120
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1050) Support filter pushdown and column pruning for MOR table on Spark Datasource

2020-07-21 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1050:
-
Fix Version/s: (was: 0.6.1)
   0.6.0

> Support filter pushdown and column pruning for MOR table on Spark Datasource
> 
>
> Key: HUDI-1050
> URL: https://issues.apache.org/jira/browse/HUDI-1050
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> We need to use the information provided by PrunedFilteredScan to push down 
> the filter and column projection. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1114) Explore Spark Structure Streaming for Hudi Dataset

2020-07-20 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1114:
-
Status: Open  (was: New)

> Explore Spark Structure Streaming for Hudi Dataset
> --
>
> Key: HUDI-1114
> URL: https://issues.apache.org/jira/browse/HUDI-1114
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Yanjia Gary Li
>Priority: Minor
>
> [https://github.com/apache/hudi/issues/1839]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1114) Explore Spark Structure Streaming for Hudi Dataset

2020-07-20 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-1114:


 Summary: Explore Spark Structure Streaming for Hudi Dataset
 Key: HUDI-1114
 URL: https://issues.apache.org/jira/browse/HUDI-1114
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Spark Integration
Reporter: Yanjia Gary Li


[https://github.com/apache/hudi/issues/1839]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1101) Decouple Hive dependencies from hudi-spark and hudi-utilities

2020-07-16 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-1101:


 Summary: Decouple Hive dependencies from hudi-spark and 
hudi-utilities
 Key: HUDI-1101
 URL: https://issues.apache.org/jira/browse/HUDI-1101
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Yanjia Gary Li


We have syncHive tool in both hudi-spark and hudi-utilities modules. This might 
cause dependency conflict when the user don't use Hive at all. We could move 
all the hive sync related method to hudi-hive-snyc module.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1051) Improve MOR datasource reader file listing and path handling

2020-06-24 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1051:
-
Status: Open  (was: New)

> Improve MOR datasource reader file listing and path handling
> 
>
> Key: HUDI-1051
> URL: https://issues.apache.org/jira/browse/HUDI-1051
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1052) Support vectorized reader for MOR datasource reader

2020-06-24 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1052:
-
Status: Open  (was: New)

> Support vectorized reader for MOR datasource reader
> ---
>
> Key: HUDI-1052
> URL: https://issues.apache.org/jira/browse/HUDI-1052
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1050) Support filter pushdown and column pruning for MOR table on Spark Datasource

2020-06-24 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1050:
-
Status: Open  (was: New)

> Support filter pushdown and column pruning for MOR table on Spark Datasource
> 
>
> Key: HUDI-1050
> URL: https://issues.apache.org/jira/browse/HUDI-1050
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>
> We need to use the information provided by PrunedFilteredScan to push down 
> the filter and column projection. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1052) Support vectorized reader for MOR datasource reader

2020-06-24 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-1052:


 Summary: Support vectorized reader for MOR datasource reader
 Key: HUDI-1052
 URL: https://issues.apache.org/jira/browse/HUDI-1052
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1051) Improve MOR datasource reader file listing and path handling

2020-06-24 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-1051:


 Summary: Improve MOR datasource reader file listing and path 
handling
 Key: HUDI-1051
 URL: https://issues.apache.org/jira/browse/HUDI-1051
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1050) Support filter pushdown and column pruning for MOR table on Spark Datasource

2020-06-24 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-1050:


 Summary: Support filter pushdown and column pruning for MOR table 
on Spark Datasource
 Key: HUDI-1050
 URL: https://issues.apache.org/jira/browse/HUDI-1050
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li


We need to use the information provided by PrunedFilteredScan to push down the 
filter and column projection. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1028) Hudi write job stuck when start EmbeddedTimelineService failed

2020-06-17 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1028:
-
Description: 
With "hoodie.embed.timeline.server" set to "true" as default in 0.5.3, I 
deployed a Hudi write job on Hadoop3 and Spark2.4 in *client* mode, but the job 
was stuck after failed to start the timeline service. The driver will stay 
alive until we manually kill the job.

Also, including the javalin.io package doesn't solve this ClassNotFound 
exception. With this exception, Hudi should terminate the job or fall back to 
not using the timeline service.

  was:
With "hoodie.embed.timeline.server" set to "true" as default in 0.5.3, I 
deployed a Hudi write job on Hadoop3 and Spark2.4 in client mode, but the job 
was stuck after failed to start the timeline service. The driver will stay 
alive until we manually kill the job.

Also, including the javalin.io package doesn't solve this ClassNotFound 
exception. With this exception, Hudi should terminate the job or fall back to 
not using the timeline service.


> Hudi write job stuck when start EmbeddedTimelineService failed
> --
>
> Key: HUDI-1028
> URL: https://issues.apache.org/jira/browse/HUDI-1028
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Attachments: stack_trace.txt
>
>
> With "hoodie.embed.timeline.server" set to "true" as default in 0.5.3, I 
> deployed a Hudi write job on Hadoop3 and Spark2.4 in *client* mode, but the 
> job was stuck after failed to start the timeline service. The driver will 
> stay alive until we manually kill the job.
> Also, including the javalin.io package doesn't solve this ClassNotFound 
> exception. With this exception, Hudi should terminate the job or fall back to 
> not using the timeline service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1028) Hudi write job stuck when start EmbeddedTimelineService failed

2020-06-17 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1028:
-
Summary: Hudi write job stuck when start EmbeddedTimelineService failed  
(was: Hudi write job stuck when create EmbeddedTimelineService failed)

> Hudi write job stuck when start EmbeddedTimelineService failed
> --
>
> Key: HUDI-1028
> URL: https://issues.apache.org/jira/browse/HUDI-1028
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Attachments: stack_trace.txt
>
>
> With "hoodie.embed.timeline.server" set to "true" as default in 0.5.3, I 
> deployed a Hudi write job on Hadoop3 and Spark2.4 in client mode, but the job 
> was stuck after failed to start the timeline service. The driver will stay 
> alive until we manually kill the job.
> Also, including the javalin.io package doesn't solve this ClassNotFound 
> exception. With this exception, Hudi should terminate the job or fall back to 
> not using the timeline service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1028) Hudi write job stuck when start EmbeddedTimelineService failed

2020-06-17 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138910#comment-17138910
 ] 

Yanjia Gary Li commented on HUDI-1028:
--

Hi [~xleesf], have you seen similar things happened in your local build? I 
remembered you mention something related to the EmbedTimelineServer on the 
release vote.

> Hudi write job stuck when start EmbeddedTimelineService failed
> --
>
> Key: HUDI-1028
> URL: https://issues.apache.org/jira/browse/HUDI-1028
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Attachments: stack_trace.txt
>
>
> With "hoodie.embed.timeline.server" set to "true" as default in 0.5.3, I 
> deployed a Hudi write job on Hadoop3 and Spark2.4 in client mode, but the job 
> was stuck after failed to start the timeline service. The driver will stay 
> alive until we manually kill the job.
> Also, including the javalin.io package doesn't solve this ClassNotFound 
> exception. With this exception, Hudi should terminate the job or fall back to 
> not using the timeline service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1028) Hudi write job stuck when create EmbeddedTimelineService failed

2020-06-17 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1028:
-
Description: 
With "hoodie.embed.timeline.server" set to "true" as default in 0.5.3, I 
deployed a Hudi write job on Hadoop3 and Spark2.4 in client mode, but the job 
was stuck after failed to start the timeline service. The driver will stay 
alive until we manually kill the job.

Also, including the javalin.io package doesn't solve this ClassNotFound 
exception. With this exception, Hudi should terminate the job or fall back to 
not using the timeline service.

> Hudi write job stuck when create EmbeddedTimelineService failed
> ---
>
> Key: HUDI-1028
> URL: https://issues.apache.org/jira/browse/HUDI-1028
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Attachments: stack_trace.txt
>
>
> With "hoodie.embed.timeline.server" set to "true" as default in 0.5.3, I 
> deployed a Hudi write job on Hadoop3 and Spark2.4 in client mode, but the job 
> was stuck after failed to start the timeline service. The driver will stay 
> alive until we manually kill the job.
> Also, including the javalin.io package doesn't solve this ClassNotFound 
> exception. With this exception, Hudi should terminate the job or fall back to 
> not using the timeline service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1028) Hudi write job stuck when create EmbeddedTimelineService failed

2020-06-17 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1028:
-
Attachment: stack_trace.txt

> Hudi write job stuck when create EmbeddedTimelineService failed
> ---
>
> Key: HUDI-1028
> URL: https://issues.apache.org/jira/browse/HUDI-1028
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Attachments: stack_trace.txt
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1028) Hudi write job stuck when create EmbeddedTimelineService failed

2020-06-17 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1028:
-
Status: Open  (was: New)

> Hudi write job stuck when create EmbeddedTimelineService failed
> ---
>
> Key: HUDI-1028
> URL: https://issues.apache.org/jira/browse/HUDI-1028
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1028) Hudi write job stuck when create EmbeddedTimelineService failed

2020-06-17 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-1028:


 Summary: Hudi write job stuck when create EmbeddedTimelineService 
failed
 Key: HUDI-1028
 URL: https://issues.apache.org/jira/browse/HUDI-1028
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1018) Handle empty checkpoint better in delta streamer

2020-06-12 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134533#comment-17134533
 ] 

Yanjia Gary Li commented on HUDI-1018:
--

[~Litianye] since we solve this ticket together with 
[https://github.com/apache/hudi/pull/1719]. Please close this once merged.

> Handle empty checkpoint better in delta streamer
> 
>
> Key: HUDI-1018
> URL: https://issues.apache.org/jira/browse/HUDI-1018
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Yanjia Gary Li
>Assignee: Tianye Li
>Priority: Major
>
> Right now we are handling empty string checkpoint in 
> DeltaSync([https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L260])
>  and different 
> sources([https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java#L58]).
>  
> We should clean this up and put all logics into one place.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1018) Handle empty checkpoint better in delta streamer

2020-06-12 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reassigned HUDI-1018:


Assignee: Tianye Li

> Handle empty checkpoint better in delta streamer
> 
>
> Key: HUDI-1018
> URL: https://issues.apache.org/jira/browse/HUDI-1018
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Yanjia Gary Li
>Assignee: Tianye Li
>Priority: Major
>
> Right now we are handling empty string checkpoint in 
> DeltaSync([https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L260])
>  and different 
> sources([https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java#L58]).
>  
> We should clean this up and put all logics into one place.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1018) Handle empty checkpoint better in delta streamer

2020-06-09 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1018:
-
Component/s: DeltaStreamer

> Handle empty checkpoint better in delta streamer
> 
>
> Key: HUDI-1018
> URL: https://issues.apache.org/jira/browse/HUDI-1018
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Yanjia Gary Li
>Priority: Major
>
> Right now we are handling empty string checkpoint in 
> DeltaSync([https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L260])
>  and different 
> sources([https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java#L58]).
>  
> We should clean this up and put all logics into one place.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1018) Handle empty checkpoint better in delta streamer

2020-06-09 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1018:
-
Status: Open  (was: New)

> Handle empty checkpoint better in delta streamer
> 
>
> Key: HUDI-1018
> URL: https://issues.apache.org/jira/browse/HUDI-1018
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Yanjia Gary Li
>Priority: Major
>
> Right now we are handling empty string checkpoint in 
> DeltaSync([https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L260])
>  and different 
> sources([https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java#L58]).
>  
> We should clean this up and put all logics into one place.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1018) Handle empty checkpoint better in delta streamer

2020-06-09 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-1018:


 Summary: Handle empty checkpoint better in delta streamer
 Key: HUDI-1018
 URL: https://issues.apache.org/jira/browse/HUDI-1018
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Yanjia Gary Li


Right now we are handling empty string checkpoint in 
DeltaSync([https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L260])
 and different 
sources([https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java#L58]).
 

We should clean this up and put all logics into one place.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-905) Support PrunedFilteredScan for Spark Datasource

2020-06-09 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li closed HUDI-905.
---
Resolution: Not A Problem

TableScan already supported filter and projection pushdown.

> Support PrunedFilteredScan for Spark Datasource
> ---
>
> Key: HUDI-905
> URL: https://issues.apache.org/jira/browse/HUDI-905
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>
> Hudi Spark Datasource incremental view currently is using 
> DataSourceReadOptions.PUSH_DOWN_INCR_FILTERS_OPT_KEY to push down the filter.
> If we wanna use Spark predicate pushdown in a native way, we need to 
> implement PrunedFilteredScan for Hudi Datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-610) MOR table Impala read support

2020-06-09 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-610:

Summary: MOR table Impala read support  (was: Impala nea real time table 
support)

> MOR table Impala read support
> -
>
> Key: HUDI-610
> URL: https://issues.apache.org/jira/browse/HUDI-610
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Priority: Major
>
> Impala uses the JAVA based module call "frontend" to list all the files to 
> scan and let the C++ based "backend" to do all the file scanning. 
> Merge Avro and Parquet could be difficult because it might need to have a 
> custom merging logic like RealtimeCompactedRecordReader to be implemented in 
> backend using C++, but I think it will be doable to have something like 
> RealtimeUnmergedRecordReader which only need some changes in the frontend. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-610) Impala nea real time table support

2020-06-09 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reassigned HUDI-610:
---

Assignee: (was: Yanjia Gary Li)

> Impala nea real time table support
> --
>
> Key: HUDI-610
> URL: https://issues.apache.org/jira/browse/HUDI-610
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Priority: Major
>
> Impala uses the JAVA based module call "frontend" to list all the files to 
> scan and let the C++ based "backend" to do all the file scanning. 
> Merge Avro and Parquet could be difficult because it might need to have a 
> custom merging logic like RealtimeCompactedRecordReader to be implemented in 
> backend using C++, but I think it will be doable to have something like 
> RealtimeUnmergedRecordReader which only need some changes in the frontend. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-06-09 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-494.
-
Resolution: Fixed

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-06-09 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li closed HUDI-494.
---

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-822) Decouple hoodie related methods with Hoodie Input Formats

2020-06-09 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-822.
-
Resolution: Fixed

> Decouple hoodie related methods with Hoodie Input Formats
> -
>
> Key: HUDI-822
> URL: https://issues.apache.org/jira/browse/HUDI-822
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: pull-request-available
>
> In order to support multiple query engines, we need to generalize the Hudi 
> input format and Hudi record merging logic. And decouple from 
> MapredParquetInputFormat, which is depending on Hive. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-822) Decouple hoodie related methods with Hoodie Input Formats

2020-06-09 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li closed HUDI-822.
---

> Decouple hoodie related methods with Hoodie Input Formats
> -
>
> Key: HUDI-822
> URL: https://issues.apache.org/jira/browse/HUDI-822
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: pull-request-available
>
> In order to support multiple query engines, we need to generalize the Hudi 
> input format and Hudi record merging logic. And decouple from 
> MapredParquetInputFormat, which is depending on Hive. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1011) Refactor hudi-client unit tests structure

2020-06-08 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1011:
-
Status: Open  (was: New)

> Refactor hudi-client unit tests structure
> -
>
> Key: HUDI-1011
> URL: https://issues.apache.org/jira/browse/HUDI-1011
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Yanjia Gary Li
>Priority: Major
>  Labels: help-wanted
>
> hudi-client unit tests are the most time-consuming test module and not 
> stable. We initialize and clean up resources for every single unit test, 
> which is inefficient. We can refactor the hudi-client test structure to run 
> more tests in one initialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1011) Refactor hudi-client unit tests structure

2020-06-08 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1011:
-
Component/s: Testing

> Refactor hudi-client unit tests structure
> -
>
> Key: HUDI-1011
> URL: https://issues.apache.org/jira/browse/HUDI-1011
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Yanjia Gary Li
>Priority: Major
>  Labels: help-wanted
>
> hudi-client unit tests are the most time-consuming test module and not 
> stable. We initialize and clean up resources for every single unit test, 
> which is inefficient. We can refactor the hudi-client test structure to run 
> more tests in one initialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1010) Fix the memory leak for hudi-client unit tests

2020-06-08 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1010:
-
Status: Open  (was: New)

> Fix the memory leak for hudi-client unit tests
> --
>
> Key: HUDI-1010
> URL: https://issues.apache.org/jira/browse/HUDI-1010
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Testing
>Reporter: Yanjia Gary Li
>Priority: Major
>  Labels: help-wanted
> Attachments: image-2020-06-08-09-22-08-864.png
>
>
> hudi-client unit test has a memory leak, which could be some resources are 
> not properly released during the cleanup. The memory consumption was 
> accumulating over time and lead to the Travis CI failure. 
> By using the IntelliJ memory analysis tool, we can find the major leak was 
> HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c
> Related PR: [https://github.com/apache/hudi/pull/1707]
> [https://github.com/apache/hudi/pull/1697]
> !image-2020-06-08-09-22-08-864.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1010) Fix the memory leak for hudi-client unit tests

2020-06-08 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1010:
-
Component/s: Testing

> Fix the memory leak for hudi-client unit tests
> --
>
> Key: HUDI-1010
> URL: https://issues.apache.org/jira/browse/HUDI-1010
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Testing
>Reporter: Yanjia Gary Li
>Priority: Major
>  Labels: help-wanted
> Attachments: image-2020-06-08-09-22-08-864.png
>
>
> hudi-client unit test has a memory leak, which could be some resources are 
> not properly released during the cleanup. The memory consumption was 
> accumulating over time and lead to the Travis CI failure. 
> By using the IntelliJ memory analysis tool, we can find the major leak was 
> HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c
> Related PR: [https://github.com/apache/hudi/pull/1707]
> [https://github.com/apache/hudi/pull/1697]
> !image-2020-06-08-09-22-08-864.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1011) Refactor hudi-client unit tests structure

2020-06-08 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1011:
-
Labels: help-wanted  (was: )

> Refactor hudi-client unit tests structure
> -
>
> Key: HUDI-1011
> URL: https://issues.apache.org/jira/browse/HUDI-1011
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yanjia Gary Li
>Priority: Major
>  Labels: help-wanted
>
> hudi-client unit tests are the most time-consuming test module and not 
> stable. We initialize and clean up resources for every single unit test, 
> which is inefficient. We can refactor the hudi-client test structure to run 
> more tests in one initialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1011) Refactor hudi-client unit tests structure

2020-06-08 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-1011:


 Summary: Refactor hudi-client unit tests structure
 Key: HUDI-1011
 URL: https://issues.apache.org/jira/browse/HUDI-1011
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Yanjia Gary Li


hudi-client unit tests are the most time-consuming test module and not stable. 
We initialize and clean up resources for every single unit test, which is 
inefficient. We can refactor the hudi-client test structure to run more tests 
in one initialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1010) Fix the memory leak for hudi-client unit tests

2020-06-08 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1010:
-
Description: 
hudi-client unit test has a memory leak, which could be some resources are not 
properly released during the cleanup. The memory consumption was accumulating 
over time and lead to the Travis CI failure. 

By using the IntelliJ memory analysis tool, we can find the major leak was 
HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c

Related PR: [https://github.com/apache/hudi/pull/1707]

[https://github.com/apache/hudi/pull/1697]

!image-2020-06-08-09-22-08-864.png!

  was:
hudi-client unit test has a memory leak, which could be some resources are not 
properly released during the cleanup. The memory consumption was accumulating 
over time and lead to the Travis CI failure. 

By using the IntelliJ memory analysis tool, we can find the major leak was 
HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c

!image-2020-06-08-09-22-08-864.png!


> Fix the memory leak for hudi-client unit tests
> --
>
> Key: HUDI-1010
> URL: https://issues.apache.org/jira/browse/HUDI-1010
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Priority: Major
>  Labels: help-wanted
> Attachments: image-2020-06-08-09-22-08-864.png
>
>
> hudi-client unit test has a memory leak, which could be some resources are 
> not properly released during the cleanup. The memory consumption was 
> accumulating over time and lead to the Travis CI failure. 
> By using the IntelliJ memory analysis tool, we can find the major leak was 
> HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c
> Related PR: [https://github.com/apache/hudi/pull/1707]
> [https://github.com/apache/hudi/pull/1697]
> !image-2020-06-08-09-22-08-864.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1010) Fix the memory leak for hudi-client unit tests

2020-06-08 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1010:
-
Description: 
hudi-client unit test has a memory leak, which could be some resources are not 
properly released during the cleanup. The memory consumption was accumulating 
over time and lead to the Travis CI failure. 

By using the IntelliJ memory analysis tool, we can find the major leak was 
HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c

!image-2020-06-08-09-22-08-864.png!

  was:
hudi-client unit test has a memory leak, which could be some resources are not 
released during the cleanup. The memory consumption was accumulating overtime 
and lead the Travis CI failure. 

By using the IntelliJ memory analysis tool, we can find the major leak was 
HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c

!image-2020-06-08-09-22-08-864.png!


> Fix the memory leak for hudi-client unit tests
> --
>
> Key: HUDI-1010
> URL: https://issues.apache.org/jira/browse/HUDI-1010
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Priority: Major
>  Labels: help-wanted
> Attachments: image-2020-06-08-09-22-08-864.png
>
>
> hudi-client unit test has a memory leak, which could be some resources are 
> not properly released during the cleanup. The memory consumption was 
> accumulating over time and lead to the Travis CI failure. 
> By using the IntelliJ memory analysis tool, we can find the major leak was 
> HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c
> !image-2020-06-08-09-22-08-864.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1010) Fix the memory leak for hudi-client unit tests

2020-06-08 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1010:
-
Labels: help-wanted  (was: )

> Fix the memory leak for hudi-client unit tests
> --
>
> Key: HUDI-1010
> URL: https://issues.apache.org/jira/browse/HUDI-1010
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Priority: Major
>  Labels: help-wanted
> Attachments: image-2020-06-08-09-22-08-864.png
>
>
> hudi-client unit test has a memory leak, which could be some resources are 
> not released during the cleanup. The memory consumption was accumulating 
> overtime and lead the Travis CI failure. 
> By using the IntelliJ memory analysis tool, we can find the major leak was 
> HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c
> !image-2020-06-08-09-22-08-864.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1010) Fix the memory leak for hudi-client unit tests

2020-06-08 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-1010:


 Summary: Fix the memory leak for hudi-client unit tests
 Key: HUDI-1010
 URL: https://issues.apache.org/jira/browse/HUDI-1010
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Yanjia Gary Li
 Attachments: image-2020-06-08-09-22-08-864.png

hudi-client unit test has a memory leak, which could be some resources are not 
released during the cleanup. The memory consumption was accumulating overtime 
and lead the Travis CI failure. 

By using the IntelliJ memory analysis tool, we can find the major leak was 
HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c

!image-2020-06-08-09-22-08-864.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-05-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-773.
-
Resolution: Fixed

Azure info was added to the docs.

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-05-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li closed HUDI-773.
---

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-804) Add Azure Support to Hudi Doc

2020-05-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-804.
-
Resolution: Fixed

> Add Azure Support to Hudi Doc
> -
>
> Key: HUDI-804
> URL: https://issues.apache.org/jira/browse/HUDI-804
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0, pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-804) Add Azure Support to Hudi Doc

2020-05-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li closed HUDI-804.
---

> Add Azure Support to Hudi Doc
> -
>
> Key: HUDI-804
> URL: https://issues.apache.org/jira/browse/HUDI-804
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0, pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-805) Verify which types of Azure storage support Hudi

2020-05-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-805.
-
Resolution: Fixed

Azure Data Lake Storage Gen 2 and Azure Blob Storage support Hudi.

> Verify which types of Azure storage support Hudi
> 
>
> Key: HUDI-805
> URL: https://issues.apache.org/jira/browse/HUDI-805
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0
>
> Azure has the following storage options:
> Azure Data Lake Storage Gen 1
> Azure Data Lake Storage Gen 2
> Azure Blob Storage(legacy name: windows azure storage blob)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-805) Verify which types of Azure storage support Hudi

2020-05-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li closed HUDI-805.
---

> Verify which types of Azure storage support Hudi
> 
>
> Key: HUDI-805
> URL: https://issues.apache.org/jira/browse/HUDI-805
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0
>
> Azure has the following storage options:
> Azure Data Lake Storage Gen 1
> Azure Data Lake Storage Gen 2
> Azure Blob Storage(legacy name: windows azure storage blob)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-805) Verify which types of Azure storage support Hudi

2020-05-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-805:

Status: Open  (was: New)

> Verify which types of Azure storage support Hudi
> 
>
> Key: HUDI-805
> URL: https://issues.apache.org/jira/browse/HUDI-805
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0
>
> Azure has the following storage options:
> Azure Data Lake Storage Gen 1
> Azure Data Lake Storage Gen 2
> Azure Blob Storage(legacy name: windows azure storage blob)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-805) Verify which types of Azure storage support Hudi

2020-05-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-805:

Status: In Progress  (was: Open)

> Verify which types of Azure storage support Hudi
> 
>
> Key: HUDI-805
> URL: https://issues.apache.org/jira/browse/HUDI-805
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0
>
> Azure has the following storage options:
> Azure Data Lake Storage Gen 1
> Azure Data Lake Storage Gen 2
> Azure Blob Storage(legacy name: windows azure storage blob)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-804) Add Azure Support to Hudi Doc

2020-05-25 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-804:

Status: In Progress  (was: Open)

> Add Azure Support to Hudi Doc
> -
>
> Key: HUDI-804
> URL: https://issues.apache.org/jira/browse/HUDI-804
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-804) Add Azure Support to Hudi Doc

2020-05-25 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-804:

Status: Open  (was: New)

> Add Azure Support to Hudi Doc
> -
>
> Key: HUDI-804
> URL: https://issues.apache.org/jira/browse/HUDI-804
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-110) Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer

2020-05-23 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17114980#comment-17114980
 ] 

Yanjia Gary Li commented on HUDI-110:
-

[~shivnarayan] no, the PR is not related to this ticket.

This ticket is more like exploring a new feature and will take some time. We 
can remove the bug-bash tag.

> Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer
> --
>
> Key: HUDI-110
> URL: https://issues.apache.org/jira/browse/HUDI-110
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer, Spark Integration, Usability
>Reporter: Balaji Varadarajan
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0, pull-request-available
>
> Currently
> SlashEncodedDayPartitionValueExtractor is the default being used. This is not 
> a common format outside Uber.
>  
> Also, Spark DataSource provides partitionedBy clauses which has not been 
> integrated for Hudi Data Source.  We need to investigate how we can leverage 
> partitionBy clause for partitioning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-110) Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer

2020-05-23 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-110:

Labels:   (was: bug-bash-0.6.0 pull-request-available)

> Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer
> --
>
> Key: HUDI-110
> URL: https://issues.apache.org/jira/browse/HUDI-110
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer, Spark Integration, Usability
>Reporter: Balaji Varadarajan
>Assignee: Yanjia Gary Li
>Priority: Minor
>
> Currently
> SlashEncodedDayPartitionValueExtractor is the default being used. This is not 
> a common format outside Uber.
>  
> Also, Spark DataSource provides partitionedBy clauses which has not been 
> integrated for Hudi Data Source.  We need to investigate how we can leverage 
> partitionBy clause for partitioning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-23 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reassigned HUDI-494:
---

Assignee: Yanjia Gary Li  (was: lamber-ken)

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-23 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17114967#comment-17114967
 ] 

Yanjia Gary Li commented on HUDI-494:
-

[~shivnarayan] this is still under review.

[https://github.com/apache/hudi/pull/1602]

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-920) Incremental view on MOR table using Spark Datasource

2020-05-22 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-920:

Fix Version/s: 0.6.0

> Incremental view on MOR table using Spark Datasource
> 
>
> Key: HUDI-920
> URL: https://issues.apache.org/jira/browse/HUDI-920
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-920) Incremental view on MOR table using Spark Datasource

2020-05-22 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-920:

Status: Open  (was: New)

> Incremental view on MOR table using Spark Datasource
> 
>
> Key: HUDI-920
> URL: https://issues.apache.org/jira/browse/HUDI-920
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-920) Incremental view on MOR table using Spark Datasource

2020-05-22 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-920:
---

 Summary: Incremental view on MOR table using Spark Datasource
 Key: HUDI-920
 URL: https://issues.apache.org/jira/browse/HUDI-920
 Project: Apache Hudi (incubating)
  Issue Type: New Feature
  Components: Spark Integration
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-905) Support PrunedFilteredScan for Spark Datasource

2020-05-21 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reassigned HUDI-905:
---

Assignee: Yanjia Gary Li

> Support PrunedFilteredScan for Spark Datasource
> ---
>
> Key: HUDI-905
> URL: https://issues.apache.org/jira/browse/HUDI-905
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>
> Hudi Spark Datasource incremental view currently is using 
> DataSourceReadOptions.PUSH_DOWN_INCR_FILTERS_OPT_KEY to push down the filter.
> If we wanna use Spark predicate pushdown in a native way, we need to 
> implement PrunedFilteredScan for Hudi Datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-905) Support PrunedFilteredScan for Spark Datasource

2020-05-20 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-905:

Status: Open  (was: New)

> Support PrunedFilteredScan for Spark Datasource
> ---
>
> Key: HUDI-905
> URL: https://issues.apache.org/jira/browse/HUDI-905
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Priority: Minor
>
> Hudi Spark Datasource incremental view currently is using 
> DataSourceReadOptions.PUSH_DOWN_INCR_FILTERS_OPT_KEY to push down the filter.
> If we wanna use Spark predicate pushdown in a native way, we need to 
> implement PrunedFilteredScan for Hudi Datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-905) Support PrunedFilteredScan for Spark Datasource

2020-05-20 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-905:

Component/s: Spark Integration

> Support PrunedFilteredScan for Spark Datasource
> ---
>
> Key: HUDI-905
> URL: https://issues.apache.org/jira/browse/HUDI-905
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Yanjia Gary Li
>Priority: Minor
>
> Hudi Spark Datasource incremental view currently is using 
> DataSourceReadOptions.PUSH_DOWN_INCR_FILTERS_OPT_KEY to push down the filter.
> If we wanna use Spark predicate pushdown in a native way, we need to 
> implement PrunedFilteredScan for Hudi Datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-905) Support PrunedFilteredScan for Spark Datasource

2020-05-20 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-905:

Priority: Minor  (was: Major)

> Support PrunedFilteredScan for Spark Datasource
> ---
>
> Key: HUDI-905
> URL: https://issues.apache.org/jira/browse/HUDI-905
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Priority: Minor
>
> Hudi Spark Datasource incremental view currently is using 
> DataSourceReadOptions.PUSH_DOWN_INCR_FILTERS_OPT_KEY to push down the filter.
> If we wanna use Spark predicate pushdown in a native way, we need to 
> implement PrunedFilteredScan for Hudi Datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-905) Support PrunedFilteredScan for Spark Datasource

2020-05-20 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-905:

Description: 
Hudi Spark Datasource incremental view currently is using 
DataSourceReadOptions.PUSH_DOWN_INCR_FILTERS_OPT_KEY to push down the filter.

If we wanna use Spark predicate pushdown in a native way, we need to implement 
PrunedFilteredScan for Hudi Datasource.

> Support PrunedFilteredScan for Spark Datasource
> ---
>
> Key: HUDI-905
> URL: https://issues.apache.org/jira/browse/HUDI-905
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Priority: Major
>
> Hudi Spark Datasource incremental view currently is using 
> DataSourceReadOptions.PUSH_DOWN_INCR_FILTERS_OPT_KEY to push down the filter.
> If we wanna use Spark predicate pushdown in a native way, we need to 
> implement PrunedFilteredScan for Hudi Datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-905) Support PrunedFilteredScan for Spark Datasource

2020-05-20 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-905:

Summary: Support PrunedFilteredScan for Spark Datasource  (was: Support 
native filter pushdown for Spark Datasource)

> Support PrunedFilteredScan for Spark Datasource
> ---
>
> Key: HUDI-905
> URL: https://issues.apache.org/jira/browse/HUDI-905
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-905) Support PrunedFilteredScan for Spark Datasource

2020-05-20 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reassigned HUDI-905:
---

Assignee: (was: Yanjia Gary Li)

> Support PrunedFilteredScan for Spark Datasource
> ---
>
> Key: HUDI-905
> URL: https://issues.apache.org/jira/browse/HUDI-905
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-30) Explore support for Spark Datasource V2

2020-05-19 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reassigned HUDI-30:
--

Assignee: (was: Yanjia Gary Li)

> Explore support for Spark Datasource V2
> ---
>
> Key: HUDI-30
> URL: https://issues.apache.org/jira/browse/HUDI-30
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Priority: Major
>
> https://github.com/uber/hudi/issues/501



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-110) Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer

2020-05-19 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111795#comment-17111795
 ] 

Yanjia Gary Li commented on HUDI-110:
-

IIUC, this ticket is trying to extract the partition info from the folder 
structure when querying through Spark. Please let me know if I am wrong. 

I made a PR with an example. This feature is actually supported already. 

> Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer
> --
>
> Key: HUDI-110
> URL: https://issues.apache.org/jira/browse/HUDI-110
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer, Spark Integration, Usability
>Reporter: Balaji Varadarajan
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0, pull-request-available
>
> Currently
> SlashEncodedDayPartitionValueExtractor is the default being used. This is not 
> a common format outside Uber.
>  
> Also, Spark DataSource provides partitionedBy clauses which has not been 
> integrated for Hudi Data Source.  We need to investigate how we can leverage 
> partitionBy clause for partitioning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-905) Support native filter pushdown for Spark Datasource

2020-05-17 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-905:
---

 Summary: Support native filter pushdown for Spark Datasource
 Key: HUDI-905
 URL: https://issues.apache.org/jira/browse/HUDI-905
 Project: Apache Hudi (incubating)
  Issue Type: New Feature
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-890) Prepare for 0.5.3 patch release

2020-05-17 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17109805#comment-17109805
 ] 

Yanjia Gary Li commented on HUDI-890:
-

Hi [~bhavanisudha] , #1602 HUDI-494 fix incorrect record size estimation was 
pushed to 0.6.0. Thanks

> Prepare for 0.5.3 patch release
> ---
>
> Key: HUDI-890
> URL: https://issues.apache.org/jira/browse/HUDI-890
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.5.3
>
>
> The following commits are included in this release.
>  * #1372 HUDI-652 Decouple HoodieReadClient and AbstractHoodieClient to break 
> the inheritance chain
>  * #1388 HUDI-681 Remove embeddedTimelineService from HoodieReadClient
>  * #1350 HUDI-629: Replace Guava's Hashing with an equivalent in 
> NumericUtils.java
>  * #1505 [HUDI - 738] Add validation to DeltaStreamer to fail fast when 
> filterDupes is enabled on UPSERT mode.
>  * #1517 HUDI-799 Use appropriate FS when loading configs
>  * #1406 HUDI-713 Fix conversion of Spark array of struct type to Avro schema
>  * #1394 HUDI-656[Performance] Return a dummy Spark relation after writing 
> the DataFrame
>  * #1576 HUDI-850 Avoid unnecessary listings in incremental cleaning mode
>  * #1421 HUDI-724 Parallelize getSmallFiles for partitions
>  * #1330 HUDI-607 Fix to allow creation/syncing of Hive tables partitioned by 
> Date type columns
>  * #1413 Add constructor to HoodieROTablePathFilter
>  * #1415 HUDI-539 Make ROPathFilter conf member serializable
>  * #1578 Add changes for presto mor queries
>  * #1506 HUDI-782 Add support of Aliyun object storage service.
>  * #1432 HUDI-716 Exception: Not an Avro data file when running 
> HoodieCleanClient.runClean
>  * #1422 HUDI-400 Check upgrade from old plan to new plan for compaction
>  * #1448 [MINOR] Update DOAP with 0.5.2 Release
>  * #1466 HUDI-742 Fix Java Math Exception
>  * #1416 HUDI-717 Fixed usage of HiveDriver for DDL statements.
>  * #1427 HUDI-727: Copy default values of fields if not present when 
> rewriting incoming record with new schema
>  * #1515 HUDI-795 Handle auto-deleted empty aux folder
>  * #1547 [MINOR]: Fix cli docs for DeltaStreamer
>  * #1580 HUDI-852 adding check for table name for Append Save mode
>  * #1537 [MINOR] fixed building IndexFileFilter with a wrong condition in 
> HoodieGlobalBloomIndex class
>  * #1434 HUDI-616 Fixed parquet files getting created on local FS
>  * #1633 HUDI-858 Allow multiple operations to be executed within a single 
> commit
>  * #1634 HUDI-846Enable Incremental cleaning and embedded timeline-server by 
> default
>  * #1596 HUDI-863 get decimal properties from derived spark DataType
>  * #1602 HUDI-494 fix incorrect record size estimation
>  * #1636 HUDI-895 Remove unnecessary listing .hoodie folder when using 
> timeline server
>  * #1584 HUDI-902 Avoid exception when getSchemaProvider
>  * #1612 HUDI-528 Handle empty commit in incremental pulling
>  * #1511 HUDI-789Adjust logic of upsert in HDFSParquetImporter
>  * #1627 HUDI-889 Writer supports useJdbc configuration when hive 
> synchronization is enabled



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-17 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-494:

Fix Version/s: (was: 0.5.3)

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: lamber-ken
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-110) Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer

2020-05-17 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-110:

Status: In Progress  (was: Open)

> Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer
> --
>
> Key: HUDI-110
> URL: https://issues.apache.org/jira/browse/HUDI-110
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer, Spark Integration, Usability
>Reporter: Balaji Varadarajan
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0
>
> Currently
> SlashEncodedDayPartitionValueExtractor is the default being used. This is not 
> a common format outside Uber.
>  
> Also, Spark DataSource provides partitionedBy clauses which has not been 
> integrated for Hudi Data Source.  We need to investigate how we can leverage 
> partitionBy clause for partitioning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-110) Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer

2020-05-15 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108528#comment-17108528
 ] 

Yanjia Gary Li commented on HUDI-110:
-

Hi [~bhasudha] , I can pick up this ticket if no one is working on this right 
now.

> Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer
> --
>
> Key: HUDI-110
> URL: https://issues.apache.org/jira/browse/HUDI-110
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer, Spark Integration, Usability
>Reporter: Balaji Varadarajan
>Assignee: Bhavani Sudha Saktheeswaran
>Priority: Minor
>  Labels: bug-bash-0.6.0
>
> Currently
> SlashEncodedDayPartitionValueExtractor is the default being used. This is not 
> a common format outside Uber.
>  
> Also, Spark DataSource provides partitionedBy clauses which has not been 
> integrated for Hudi Data Source.  We need to investigate how we can leverage 
> partitionBy clause for partitioning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-528) Incremental Pull fails when latest commit is empty

2020-05-15 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-528.
-
Resolution: Fixed

> Incremental Pull fails when latest commit is empty
> --
>
> Key: HUDI-528
> URL: https://issues.apache.org/jira/browse/HUDI-528
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: Javier Vega
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0, help-requested, pull-request-available
> Fix For: 0.5.3
>
>
> When trying to create an incremental view of a dataset, an exception is 
> thrown when the latest commit in the time range is empty. In order to 
> determine the schema of the dataset, Hudi will grab the [latest commit file, 
> parse it, and grab the first metadata file 
> path|https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80].
>  If the latest commit was empty though, the field which is used to determine 
> file paths (partitionToWriteStats) will be empty causing the following 
> exception:
>  
>  
> {code:java}
> java.util.NoSuchElementException
>   at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447)
>   at java.util.HashMap$ValueIterator.next(HashMap.java:1474)
>   at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-890) Prepare for 0.5.3 patch release

2020-05-13 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106550#comment-17106550
 ] 

Yanjia Gary Li commented on HUDI-890:
-

Hi [~bhavanisudha], I have two bug fixing tickets that might be fit for 0.5.3 
release. 

https://jira.apache.org/jira/browse/HUDI-494

https://jira.apache.org/jira/browse/HUDI-528

I believe these two will likely be merged this week. 

> Prepare for 0.5.3 patch release
> ---
>
> Key: HUDI-890
> URL: https://issues.apache.org/jira/browse/HUDI-890
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.5.3
>
>
> The following commits are included in this release.
>  * #1372 [HUDI-652] Decouple HoodieReadClient and AbstractHoodieClient to 
> break the inheritance chain
>  * #1388 [HUDI-681] Remove embeddedTimelineService from HoodieReadClient
>  * #1350 [HUDI-629]: Replace Guava's Hashing with an equivalent in 
> NumericUtils.java
>  * #1505 [HUDI - 738] Add validation to DeltaStreamer to fail fast when 
> filterDupes is enabled on UPSERT mode.
>  * #1517 [HUDI-799] Use appropriate FS when loading configs
>  * #1406 [HUDI-713] Fix conversion of Spark array of struct type to Avro 
> schema
>  * #1394 [HUDI-656][Performance] Return a dummy Spark relation after writing 
> the DataFrame
>  * #1576 [HUDI-850] Avoid unnecessary listings in incremental cleaning mode
>  * #1421 [HUDI-724] Parallelize getSmallFiles for partitions
>  * #1330 [HUDI-607] Fix to allow creation/syncing of Hive tables partitioned 
> by Date type columns
>  * #1413 Add constructor to HoodieROTablePathFilter
>  * #1415 [HUDI-539] Make ROPathFilter conf member serializable
>  * #1578 Add changes for presto mor queries
>  * #1506 [HUDI-782] Add support of Aliyun object storage service.
>  * #1432 [HUDI-716] Exception: Not an Avro data file when running 
> HoodieCleanClient.runClean
>  * #1422 [HUDI-400] Check upgrade from old plan to new plan for compaction
>  * #1448 [MINOR] Update DOAP with 0.5.2 Release
>  * #1466 [HUDI-742] Fix Java Math Exception
>  * #1416 [HUDI-717] Fixed usage of HiveDriver for DDL statements.
>  * #1427 [HUDI-727]: Copy default values of fields if not present when 
> rewriting incoming record with new schema
>  * #1515 [HUDI-795] Handle auto-deleted empty aux folder
>  * #1547 [MINOR]: Fix cli docs for DeltaStreamer
>  * #1580 [HUDI-852] adding check for table name for Append Save mode
>  * #1537 [MINOR] fixed building IndexFileFilter with a wrong condition in 
> HoodieGlobalBloomIndex class
>  * #1434 [HUDI-616] Fixed parquet files getting created on local FS



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-318) Update Migration Guide to Include Delta Streamer

2020-05-12 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reassigned HUDI-318:
---

Assignee: (was: Yanjia Gary Li)

> Update Migration Guide to Include Delta Streamer
> 
>
> Key: HUDI-318
> URL: https://issues.apache.org/jira/browse/HUDI-318
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Yanjia Gary Li
>Priority: Minor
>  Labels: doc
>
> [http://hudi.apache.org/migration_guide.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-12 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-494:

Fix Version/s: 0.5.3

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.5.3
>
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-528) Incremental Pull fails when latest commit is empty

2020-05-12 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-528:

Fix Version/s: 0.5.3

> Incremental Pull fails when latest commit is empty
> --
>
> Key: HUDI-528
> URL: https://issues.apache.org/jira/browse/HUDI-528
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: Javier Vega
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0, help-requested, pull-request-available
> Fix For: 0.5.3
>
>
> When trying to create an incremental view of a dataset, an exception is 
> thrown when the latest commit in the time range is empty. In order to 
> determine the schema of the dataset, Hudi will grab the [latest commit file, 
> parse it, and grab the first metadata file 
> path|https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80].
>  If the latest commit was empty though, the field which is used to determine 
> file paths (partitionToWriteStats) will be empty causing the following 
> exception:
>  
>  
> {code:java}
> java.util.NoSuchElementException
>   at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447)
>   at java.util.HashMap$ValueIterator.next(HashMap.java:1474)
>   at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-528) Incremental Pull fails when latest commit is empty

2020-05-10 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-528:

Status: In Progress  (was: Open)

> Incremental Pull fails when latest commit is empty
> --
>
> Key: HUDI-528
> URL: https://issues.apache.org/jira/browse/HUDI-528
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: Javier Vega
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0, help-requested, pull-request-available
>
> When trying to create an incremental view of a dataset, an exception is 
> thrown when the latest commit in the time range is empty. In order to 
> determine the schema of the dataset, Hudi will grab the [latest commit file, 
> parse it, and grab the first metadata file 
> path|https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80].
>  If the latest commit was empty though, the field which is used to determine 
> file paths (partitionToWriteStats) will be empty causing the following 
> exception:
>  
>  
> {code:java}
> java.util.NoSuchElementException
>   at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447)
>   at java.util.HashMap$ValueIterator.next(HashMap.java:1474)
>   at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-528) Incremental Pull fails when latest commit is empty

2020-05-10 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reassigned HUDI-528:
---

Assignee: Yanjia Gary Li

> Incremental Pull fails when latest commit is empty
> --
>
> Key: HUDI-528
> URL: https://issues.apache.org/jira/browse/HUDI-528
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: Javier Vega
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0, help-requested
>
> When trying to create an incremental view of a dataset, an exception is 
> thrown when the latest commit in the time range is empty. In order to 
> determine the schema of the dataset, Hudi will grab the [latest commit file, 
> parse it, and grab the first metadata file 
> path|https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80].
>  If the latest commit was empty though, the field which is used to determine 
> file paths (partitionToWriteStats) will be empty causing the following 
> exception:
>  
>  
> {code:java}
> java.util.NoSuchElementException
>   at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447)
>   at java.util.HashMap$ValueIterator.next(HashMap.java:1474)
>   at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-10 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-494:

Status: In Progress  (was: Open)

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-07 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101055#comment-17101055
 ] 

Yanjia Gary Li edited comment on HUDI-494 at 5/8/20, 1:38 AM:
--

-Ok, I see what happened here. Root cause is 
[https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java#L214]-

So basically commit 1 wrote a very small file(let's say 200 records) to a new 
partition day=05. And then when commit 2 was trying to write, it looks back to 
commit 1 to get an estimated size of every record, but because commit 1 has too 
little records so it's inaccurate and way too big. So Hudi will calculate 
record/file using the big record size number and get a very small record/file. 
This lead to many small files. 


was (Author: garyli1019):
Ok, I see what happened here. Root cause is 
[https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java#L214]

So basically commit 1 wrote a very small file(let's say 200 records) to a new 
partition day=05. And then when commit 2 trying to write to day=05, it will 
look up the affected partition and use the Bloom index range from the existing 
files, so it will use 200 here. Commit 2 has much more records than 200, so it 
will create tons of files since the Bloom index range is too small.

I am not really familiar with the indexing part of the code. Please let me know 
if I understand this correctly and we can figure out a fix. [~lamber-ken] 
[~vinoth]

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-06 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101207#comment-17101207
 ] 

Yanjia Gary Li commented on HUDI-494:
-

 

Commit 1:
{code:java}

"partitionToWriteStats" : {
"year=2020/month=5/day=0/hour=0" : [ {
  "fileId" : "4aee295a-4bbd-4c74-ba49-f6d50f489524-0",
  "path" : 
"year=2020/month=5/day=0/hour=0/4aee295a-4bbd-4c74-ba49-f6d50f489524-0_0-112-1773_20200504101048.parquet",
  "prevCommit" : "null",
  "numWrites" : 21,
  "numDeletes" : 0,
  "numUpdateWrites" : 0,
  "numInserts" : 21,
  "totalWriteBytes" : 14397559,
  "totalWriteErrors" : 0,
  "tempPath" : null,
  "partitionPath" : "year=2020/month=5/day=0/hour=0",
  "totalLogRecords" : 0,
  "totalLogFilesCompacted" : 0,
  "totalLogSizeCompacted" : 0,
  "totalUpdatedRecordsCompacted" : 0,
  "totalLogBlocks" : 0,
  "totalCorruptLogBlock" : 0,
  "totalRollbackBlocks" : 0,
  "fileSizeInBytes" : 14397559
}
{code}
Commit2:
{code:java}
  "partitionToWriteStats" : {
"year=2020/month=5/day=0/hour=0" : [ {
  "fileId" : "4aee295a-4bbd-4c74-ba49-f6d50f489524-0",
  "path" : 
"year=2020/month=5/day=0/hour=0/4aee295a-4bbd-4c74-ba49-f6d50f489524-0_0-248-163129_20200505023830.parquet",
  "prevCommit" : "20200504101048",
  "numWrites" : 12817,
  "numDeletes" : 0,
  "numUpdateWrites" : 0,
  "numInserts" : 12796,
  "totalWriteBytes" : 16297335,
  "totalWriteErrors" : 0,
  "tempPath" : null,
  "partitionPath" : "year=2020/month=5/day=0/hour=0",
  "totalLogRecords" : 0,
  "totalLogFilesCompacted" : 0,
  "totalLogSizeCompacted" : 0,
  "totalUpdatedRecordsCompacted" : 0,
  "totalLogBlocks" : 0,
  "totalCorruptLogBlock" : 0,
  "totalRollbackBlocks" : 0,
  "fileSizeInBytes" : 16297335
}, {
  "fileId" : "9d0c9e79-00dd-41d2-a217-0944f8428e1c-0",
  "path" : 
"year=2020/month=5/day=0/hour=0/9d0c9e79-00dd-41d2-a217-0944f8428e1c-0_1-248-163130_20200505023830.parquet",
  "prevCommit" : "null",
  "numWrites" : 200,
  "numDeletes" : 0,
  "numUpdateWrites" : 0,
  "numInserts" : 200,
  "totalWriteBytes" : 14428883,
  "totalWriteErrors" : 0,
  "tempPath" : null,
  "partitionPath" : "year=2020/month=5/day=0/hour=0",
  "totalLogRecords" : 0,
  "totalLogFilesCompacted" : 0,
  "totalLogSizeCompacted" : 0,
  "totalUpdatedRecordsCompacted" : 0,
  "totalLogBlocks" : 0,
  "totalCorruptLogBlock" : 0,
  "totalRollbackBlocks" : 0,
  "fileSizeInBytes" : 14428883
}, {
  "fileId" : "5990beb4-bd0c-40c9-84f1-a4107287971e-0",
  "path" : 
"year=2020/month=5/day=0/hour=0/5990beb4-bd0c-40c9-84f1-a4107287971e-0_2-248-163131_20200505023830.parquet",
  "prevCommit" : "null",
  "numWrites" : 198,
  "numDeletes" : 0,
  "numUpdateWrites" : 0,
  "numInserts" : 198,
  "totalWriteBytes" : 14428338,
  "totalWriteErrors" : 0,
  "tempPath" : null,
  "partitionPath" : "year=2020/month=5/day=0/hour=0",
  "totalLogRecords" : 0,
  "totalLogFilesCompacted" : 0,
  "totalLogSizeCompacted" : 0,
  "totalUpdatedRecordsCompacted" : 0,
  "totalLogBlocks" : 0,
  "totalCorruptLogBlock" : 0,
  "totalRollbackBlocks" : 0,
  "fileSizeInBytes" : 14428338
}, {
  "fileId" : "673c5550-39c3-4611-ac68-bc0c7da065e2-0",
  "path" : 
"year=2020/month=5/day=0/hour=0/673c5550-39c3-4611-ac68-bc0c7da065e2-0_3-248-163132_20200505023830.parquet",
  "prevCommit" : "null",
  "numWrites" : 179,
  "numDeletes" : 0,
  "numUpdateWrites" : 0,
  "numInserts" : 179,
  "totalWriteBytes" : 14425571,
  "totalWriteErrors" : 0,
  "tempPath" : null,
  "partitionPath" : "year=2020/month=5/day=0/hour=0",
  "totalLogRecords" : 0,
  "totalLogFilesCompacted" : 0,
  "totalLogSizeCompacted" : 0,
  "totalUpdatedRecordsCompacted" : 0,
  "totalLogBlocks" : 0,
  "totalCorruptLogBlock" : 0,
  "totalRollbackBlocks" : 0,
  "fileSizeInBytes" : 14425571
}
{code}
 

 

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got th

[jira] [Commented] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-06 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101055#comment-17101055
 ] 

Yanjia Gary Li commented on HUDI-494:
-

Ok, I see what happened here. Root cause is 
[https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java#L214]

So basically commit 1 wrote a very small file(let's say 200 records) to a new 
partition day=05. And then when commit 2 trying to write to day=05, it will 
look up the affected partition and use the Bloom index range from the existing 
files, so it will use 200 here. Commit 2 has much more records than 200, so it 
will create tons of files since the Bloom index range is too small.

I am not really familiar with the indexing part of the code. Please let me know 
if I understand this correctly and we can figure out a fix. [~lamber-ken] 
[~vinoth]

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >