[jira] [Closed] (HUDI-590) Cut a new Doc version 0.5.1 explicitly

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-590.
--

> Cut a new Doc version 0.5.1 explicitly
> --
>
> Key: HUDI-590
> URL: https://issues.apache.org/jira/browse/HUDI-590
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Docs, Release & Administrative
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The latest version of docs needs to be tagged as 0.5.1 explicitly in the 
> site. Follow instructions in 
> [https://github.com/apache/incubator-hudi/blob/asf-site/README.md#updating-site]
>  to create a new dir 0.5.1 under docs/_docs/ 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-624) Split some of the code from PR for HUDI-479

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-624:


> Split some of the code from PR for HUDI-479 
> 
>
> Key: HUDI-624
> URL: https://issues.apache.org/jira/browse/HUDI-624
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Major
>  Labels: patch, pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This Jira is to reduce the size of the code base in PR# 1159 for HUDI-479, 
> making it easier for review.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-624) Split some of the code from PR for HUDI-479

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-624.

Resolution: Fixed

> Split some of the code from PR for HUDI-479 
> 
>
> Key: HUDI-624
> URL: https://issues.apache.org/jira/browse/HUDI-624
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Major
>  Labels: patch, pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This Jira is to reduce the size of the code base in PR# 1159 for HUDI-479, 
> making it easier for review.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-684) Introduce abstraction for writing and reading and compacting from FileGroups

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-684.
--

> Introduce abstraction for writing and reading and compacting from FileGroups 
> -
>
> Key: HUDI-684
> URL: https://issues.apache.org/jira/browse/HUDI-684
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Prashant Wason
>Priority: Blocker
>  Labels: help-requested, pull-request-available
> Fix For: 0.6.0
>
>
> We may have different combinations of base and log data 
>  
> parquet , avro (today)
> parquet, parquet 
> hfile, hfile (indexing, RFC-08)
>  
> reading/writing/compaction machinery should be solved 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-728) Support for complex record keys with TimestampBasedKeyGenerator

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-728:


> Support for complex record keys with TimestampBasedKeyGenerator
> ---
>
> Key: HUDI-728
> URL: https://issues.apache.org/jira/browse/HUDI-728
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer, Usability
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We have TimestampBasedKeyGenerator for defining custom partition paths and we 
> have ComplexKeyGenerator for supporting having combination of fields as 
> record key or partition key. 
>  
> However we do not have support for the case where one wants to have 
> combination of fields as record key along with being able to define custom 
> partition paths. This use case recently came up at my organisation. 
>  
> How about having CustomTimestampBasedKeyGenerator which supports the above 
> use case? This class can simply extend TimestampBasedKeyGenerator and allow 
> users to have combination of fields as record key.
>  
> We will try to have the implementation as generic as possible. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-728) Support for complex record keys with TimestampBasedKeyGenerator

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-728.

Resolution: Fixed

> Support for complex record keys with TimestampBasedKeyGenerator
> ---
>
> Key: HUDI-728
> URL: https://issues.apache.org/jira/browse/HUDI-728
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer, Usability
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We have TimestampBasedKeyGenerator for defining custom partition paths and we 
> have ComplexKeyGenerator for supporting having combination of fields as 
> record key or partition key. 
>  
> However we do not have support for the case where one wants to have 
> combination of fields as record key along with being able to define custom 
> partition paths. This use case recently came up at my organisation. 
>  
> How about having CustomTimestampBasedKeyGenerator which supports the above 
> use case? This class can simply extend TimestampBasedKeyGenerator and allow 
> users to have combination of fields as record key.
>  
> We will try to have the implementation as generic as possible. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-684) Introduce abstraction for writing and reading and compacting from FileGroups

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-684.

Resolution: Fixed

> Introduce abstraction for writing and reading and compacting from FileGroups 
> -
>
> Key: HUDI-684
> URL: https://issues.apache.org/jira/browse/HUDI-684
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Prashant Wason
>Priority: Blocker
>  Labels: help-requested, pull-request-available
> Fix For: 0.6.0
>
>
> We may have different combinations of base and log data 
>  
> parquet , avro (today)
> parquet, parquet 
> hfile, hfile (indexing, RFC-08)
>  
> reading/writing/compaction machinery should be solved 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-728) Support for complex record keys with TimestampBasedKeyGenerator

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-728.
--

> Support for complex record keys with TimestampBasedKeyGenerator
> ---
>
> Key: HUDI-728
> URL: https://issues.apache.org/jira/browse/HUDI-728
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer, Usability
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We have TimestampBasedKeyGenerator for defining custom partition paths and we 
> have ComplexKeyGenerator for supporting having combination of fields as 
> record key or partition key. 
>  
> However we do not have support for the case where one wants to have 
> combination of fields as record key along with being able to define custom 
> partition paths. This use case recently came up at my organisation. 
>  
> How about having CustomTimestampBasedKeyGenerator which supports the above 
> use case? This class can simply extend TimestampBasedKeyGenerator and allow 
> users to have combination of fields as record key.
>  
> We will try to have the implementation as generic as possible. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-684) Introduce abstraction for writing and reading and compacting from FileGroups

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-684:


> Introduce abstraction for writing and reading and compacting from FileGroups 
> -
>
> Key: HUDI-684
> URL: https://issues.apache.org/jira/browse/HUDI-684
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Prashant Wason
>Priority: Blocker
>  Labels: help-requested, pull-request-available
> Fix For: 0.6.0
>
>
> We may have different combinations of base and log data 
>  
> parquet , avro (today)
> parquet, parquet 
> hfile, hfile (indexing, RFC-08)
>  
> reading/writing/compaction machinery should be solved 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-742) Fix java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-742.
--

> Fix java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
> -
>
> Key: HUDI-742
> URL: https://issues.apache.org/jira/browse/HUDI-742
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: lamber-ken
>Assignee: edwinguo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> *ISSUE* : https://github.com/apache/incubator-hudi/issues/1455
> {code:java}
> at org.apache.hudi.client.HoodieWriteClient.upsert(HoodieWriteClient.java:193)
> at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206)
> at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:144)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
> at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:83)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:84)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
> ... 49 elided
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 44 in stage 11.0 failed 4 times, most recent failure: Lost task 44.3 in 
> stage 11.0 (TID 975, ip-10-81-135-85.ec2.internal, executor 6): 
> java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
> at 
> org.apache.hudi.index.bloom.BucketizedBloomCheckPartitioner.getPartition(BucketizedBloomCheckPartitioner.java:148)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:123)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSet

[jira] [Closed] (HUDI-738) Add error msg in DeltaStreamer if `filterDupes=true` is enabled for `operation=UPSERT`.

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-738.
--

> Add error msg in DeltaStreamer if  `filterDupes=true` is enabled for 
> `operation=UPSERT`. 
> -
>
> Key: HUDI-738
> URL: https://issues.apache.org/jira/browse/HUDI-738
> Project: Apache Hudi
>  Issue Type: Task
>  Components: DeltaStreamer, newbie, Usability
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This checks for dedupes with existing records in the table and thus ignores 
> updates. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-742) Fix java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-742:


> Fix java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
> -
>
> Key: HUDI-742
> URL: https://issues.apache.org/jira/browse/HUDI-742
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: lamber-ken
>Assignee: edwinguo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> *ISSUE* : https://github.com/apache/incubator-hudi/issues/1455
> {code:java}
> at org.apache.hudi.client.HoodieWriteClient.upsert(HoodieWriteClient.java:193)
> at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206)
> at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:144)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
> at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:83)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:84)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
> ... 49 elided
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 44 in stage 11.0 failed 4 times, most recent failure: Lost task 44.3 in 
> stage 11.0 (TID 975, ip-10-81-135-85.ec2.internal, executor 6): 
> java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
> at 
> org.apache.hudi.index.bloom.BucketizedBloomCheckPartitioner.getPartition(BucketizedBloomCheckPartitioner.java:148)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:123)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTas

[jira] [Resolved] (HUDI-742) Fix java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-742.

Resolution: Fixed

> Fix java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
> -
>
> Key: HUDI-742
> URL: https://issues.apache.org/jira/browse/HUDI-742
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: lamber-ken
>Assignee: edwinguo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> *ISSUE* : https://github.com/apache/incubator-hudi/issues/1455
> {code:java}
> at org.apache.hudi.client.HoodieWriteClient.upsert(HoodieWriteClient.java:193)
> at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206)
> at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:144)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
> at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:83)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:84)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
> ... 49 elided
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 44 in stage 11.0 failed 4 times, most recent failure: Lost task 44.3 in 
> stage 11.0 (TID 975, ip-10-81-135-85.ec2.internal, executor 6): 
> java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
> at 
> org.apache.hudi.index.bloom.BucketizedBloomCheckPartitioner.getPartition(BucketizedBloomCheckPartitioner.java:148)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:123)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
> at 
> org.apache.spark.scheduler.DAGSchedu

[jira] [Reopened] (HUDI-738) Add error msg in DeltaStreamer if `filterDupes=true` is enabled for `operation=UPSERT`.

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-738:


> Add error msg in DeltaStreamer if  `filterDupes=true` is enabled for 
> `operation=UPSERT`. 
> -
>
> Key: HUDI-738
> URL: https://issues.apache.org/jira/browse/HUDI-738
> Project: Apache Hudi
>  Issue Type: Task
>  Components: DeltaStreamer, newbie, Usability
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This checks for dedupes with existing records in the table and thus ignores 
> updates. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-760) Remove Rolling Stat management from Hudi Writer

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-760.
--

> Remove Rolling Stat management from Hudi Writer
> ---
>
> Key: HUDI-760
> URL: https://issues.apache.org/jira/browse/HUDI-760
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: renyi.bao
>Priority: Blocker
>  Labels: bug-bash-0.6.0, help-requested, help-wanted, newbie,, 
> pull-request-available
> Fix For: 0.6.0
>
>
> Current implementation of rolling stat is not scalable. As Consolidated 
> Metadata will be implemented eventually, we can have one design to manage 
> file-level stats too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-738) Add error msg in DeltaStreamer if `filterDupes=true` is enabled for `operation=UPSERT`.

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-738.

Resolution: Fixed

> Add error msg in DeltaStreamer if  `filterDupes=true` is enabled for 
> `operation=UPSERT`. 
> -
>
> Key: HUDI-738
> URL: https://issues.apache.org/jira/browse/HUDI-738
> Project: Apache Hudi
>  Issue Type: Task
>  Components: DeltaStreamer, newbie, Usability
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This checks for dedupes with existing records in the table and thus ignores 
> updates. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-760) Remove Rolling Stat management from Hudi Writer

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-760:


> Remove Rolling Stat management from Hudi Writer
> ---
>
> Key: HUDI-760
> URL: https://issues.apache.org/jira/browse/HUDI-760
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: renyi.bao
>Priority: Blocker
>  Labels: bug-bash-0.6.0, help-requested, help-wanted, newbie,, 
> pull-request-available
> Fix For: 0.6.0
>
>
> Current implementation of rolling stat is not scalable. As Consolidated 
> Metadata will be implemented eventually, we can have one design to manage 
> file-level stats too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-760) Remove Rolling Stat management from Hudi Writer

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-760.

Resolution: Fixed

> Remove Rolling Stat management from Hudi Writer
> ---
>
> Key: HUDI-760
> URL: https://issues.apache.org/jira/browse/HUDI-760
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: renyi.bao
>Priority: Blocker
>  Labels: bug-bash-0.6.0, help-requested, help-wanted, newbie,, 
> pull-request-available
> Fix For: 0.6.0
>
>
> Current implementation of rolling stat is not scalable. As Consolidated 
> Metadata will be implemented eventually, we can have one design to manage 
> file-level stats too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-802) AWSDmsTransformer does not handle insert -> delete of a row in a single batch correctly

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-802.

Resolution: Fixed

> AWSDmsTransformer does not handle insert -> delete of a row in a single batch 
> correctly
> ---
>
> Key: HUDI-802
> URL: https://issues.apache.org/jira/browse/HUDI-802
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Christopher Weaver
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> The provided AWSDmsAvroPayload class 
> ([https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/payload/AWSDmsAvroPayload.java])
>  currently handles cases where the "Op" column is a "D" for updates, and 
> successfully removes the row from the resulting table. 
> However, when an insert is quickly followed by a delete on the row (e.g. DMS 
> processes them together and puts the update records together in the same 
> parquet file), the row incorrectly appears in the resulting table. In this 
> case, the record is not in the table and getInsertValue is called rather than 
> combineAndGetUpdateValue. Since the logic to check for a delete is in 
> combineAndGetUpdateValue, it is skipped and the delete is missed. Something 
> like this could fix this issue: 
> [https://github.com/Weves/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/java/org/apache/hudi/payload/CustomAWSDmsAvroPayload.java].
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-802) AWSDmsTransformer does not handle insert -> delete of a row in a single batch correctly

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-802.
--

> AWSDmsTransformer does not handle insert -> delete of a row in a single batch 
> correctly
> ---
>
> Key: HUDI-802
> URL: https://issues.apache.org/jira/browse/HUDI-802
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Christopher Weaver
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> The provided AWSDmsAvroPayload class 
> ([https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/payload/AWSDmsAvroPayload.java])
>  currently handles cases where the "Op" column is a "D" for updates, and 
> successfully removes the row from the resulting table. 
> However, when an insert is quickly followed by a delete on the row (e.g. DMS 
> processes them together and puts the update records together in the same 
> parquet file), the row incorrectly appears in the resulting table. In this 
> case, the record is not in the table and getInsertValue is called rather than 
> combineAndGetUpdateValue. Since the logic to check for a delete is in 
> combineAndGetUpdateValue, it is skipped and the delete is missed. Something 
> like this could fix this issue: 
> [https://github.com/Weves/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/java/org/apache/hudi/payload/CustomAWSDmsAvroPayload.java].
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-802) AWSDmsTransformer does not handle insert -> delete of a row in a single batch correctly

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-802:


> AWSDmsTransformer does not handle insert -> delete of a row in a single batch 
> correctly
> ---
>
> Key: HUDI-802
> URL: https://issues.apache.org/jira/browse/HUDI-802
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Christopher Weaver
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> The provided AWSDmsAvroPayload class 
> ([https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/payload/AWSDmsAvroPayload.java])
>  currently handles cases where the "Op" column is a "D" for updates, and 
> successfully removes the row from the resulting table. 
> However, when an insert is quickly followed by a delete on the row (e.g. DMS 
> processes them together and puts the update records together in the same 
> parquet file), the row incorrectly appears in the resulting table. In this 
> case, the record is not in the table and getInsertValue is called rather than 
> combineAndGetUpdateValue. Since the logic to check for a delete is in 
> combineAndGetUpdateValue, it is skipped and the delete is missed. Something 
> like this could fix this issue: 
> [https://github.com/Weves/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/java/org/apache/hudi/payload/CustomAWSDmsAvroPayload.java].
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-855) Run Auto Cleaner in parallel with ingestion

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-855.
--

> Run Auto Cleaner in parallel with ingestion
> ---
>
> Key: HUDI-855
> URL: https://issues.apache.org/jira/browse/HUDI-855
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Cleaner
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Currently auto clean is run synchronously after ingestion is finished.  As 
> Cleaning and ingestion can safely happen in parallel, we can take advantage 
> and schedule them to run in parallel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-855) Run Auto Cleaner in parallel with ingestion

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-855.

Resolution: Fixed

> Run Auto Cleaner in parallel with ingestion
> ---
>
> Key: HUDI-855
> URL: https://issues.apache.org/jira/browse/HUDI-855
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Cleaner
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Currently auto clean is run synchronously after ingestion is finished.  As 
> Cleaning and ingestion can safely happen in parallel, we can take advantage 
> and schedule them to run in parallel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-855) Run Auto Cleaner in parallel with ingestion

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-855:


> Run Auto Cleaner in parallel with ingestion
> ---
>
> Key: HUDI-855
> URL: https://issues.apache.org/jira/browse/HUDI-855
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Cleaner
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Currently auto clean is run synchronously after ingestion is finished.  As 
> Cleaning and ingestion can safely happen in parallel, we can take advantage 
> and schedule them to run in parallel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-883) Update docs for new config in PR-1605

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-883.

Resolution: Fixed

> Update docs for new config in PR-1605
> -
>
> Key: HUDI-883
> URL: https://issues.apache.org/jira/browse/HUDI-883
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: Balaji Varadarajan
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 0.6.0
>
>
> PR : 
> [https://github.com/apache/incubator-hudi/pull/1605|https://github.com/apache/incubator-hudi/pull/1605/files]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-882) Update documentation with new configs for 0.6.0 release

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-882.
--

> Update documentation with new configs for 0.6.0 release
> ---
>
> Key: HUDI-882
> URL: https://issues.apache.org/jira/browse/HUDI-882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> Umbrella ticket to track new configurations that needs to be added in docs 
> page.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-883) Update docs for new config in PR-1605

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-883:


> Update docs for new config in PR-1605
> -
>
> Key: HUDI-883
> URL: https://issues.apache.org/jira/browse/HUDI-883
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: Balaji Varadarajan
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 0.6.0
>
>
> PR : 
> [https://github.com/apache/incubator-hudi/pull/1605|https://github.com/apache/incubator-hudi/pull/1605/files]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-883) Update docs for new config in PR-1605

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-883.
--

> Update docs for new config in PR-1605
> -
>
> Key: HUDI-883
> URL: https://issues.apache.org/jira/browse/HUDI-883
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: Balaji Varadarajan
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 0.6.0
>
>
> PR : 
> [https://github.com/apache/incubator-hudi/pull/1605|https://github.com/apache/incubator-hudi/pull/1605/files]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-883) Update docs for new config in PR-1605

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-883:
---
Status: Closed  (was: Patch Available)

> Update docs for new config in PR-1605
> -
>
> Key: HUDI-883
> URL: https://issues.apache.org/jira/browse/HUDI-883
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: Balaji Varadarajan
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 0.6.0
>
>
> PR : 
> [https://github.com/apache/incubator-hudi/pull/1605|https://github.com/apache/incubator-hudi/pull/1605/files]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-882) Update documentation with new configs for 0.6.0 release

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-882.

Resolution: Fixed

> Update documentation with new configs for 0.6.0 release
> ---
>
> Key: HUDI-882
> URL: https://issues.apache.org/jira/browse/HUDI-882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> Umbrella ticket to track new configurations that needs to be added in docs 
> page.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-942) Increase minmum number of delta commits for inline compaction

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-942.

Resolution: Fixed

> Increase minmum number of delta commits for inline compaction
> -
>
> Key: HUDI-942
> URL: https://issues.apache.org/jira/browse/HUDI-942
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Compaction
>Affects Versions: 0.6.0
>Reporter: Sathyaprakash Govindasamy
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Default value of number of delta commits required to do inline compaction 
> (hoodie.compact.inline.max.delta.commits) is currently 1. Because of this by 
> default every delta commit to MERGE ON READ table is doing inline compaction 
> as well automatically. 
> I think default value of 1 is little overkill and also it will make MERGE ON 
> READ work like COPY ON WRITE with compaction on every run. We should increate 
> the default value to more than 1.
> https://github.com/apache/hudi/blob/f34de3fb2738c8c36c937eba8df2a6848fafa886/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L100



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-900) Metadata Bootstrap Key Generator needs to handle complex keys correctly

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-900.
--

> Metadata Bootstrap Key Generator needs to handle complex keys correctly
> ---
>
> Key: HUDI-900
> URL: https://issues.apache.org/jira/browse/HUDI-900
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 24h
>  Remaining Estimate: 0h
>
> Look at ComplexKeyGenerator. Make sure MetadataBootstrap is of same format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-971) Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-971.
--

> Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean 
> partition name
> ---
>
> Key: HUDI-971
> URL: https://issues.apache.org/jira/browse/HUDI-971
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> When calling HFileBootstrapIndexReader.getIndexedPartitions(), it will return 
> unclean partitions because of 
> [https://github.com/apache/hbase/blob/rel/1.2.3/hbase-common/src/main/java/org/apache/hadoop/hbase/CellUtil.java#L768].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-942) Increase minmum number of delta commits for inline compaction

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-942:


> Increase minmum number of delta commits for inline compaction
> -
>
> Key: HUDI-942
> URL: https://issues.apache.org/jira/browse/HUDI-942
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Compaction
>Affects Versions: 0.6.0
>Reporter: Sathyaprakash Govindasamy
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Default value of number of delta commits required to do inline compaction 
> (hoodie.compact.inline.max.delta.commits) is currently 1. Because of this by 
> default every delta commit to MERGE ON READ table is doing inline compaction 
> as well automatically. 
> I think default value of 1 is little overkill and also it will make MERGE ON 
> READ work like COPY ON WRITE with compaction on every run. We should increate 
> the default value to more than 1.
> https://github.com/apache/hudi/blob/f34de3fb2738c8c36c937eba8df2a6848fafa886/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L100



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-900) Metadata Bootstrap Key Generator needs to handle complex keys correctly

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-900.

Resolution: Fixed

> Metadata Bootstrap Key Generator needs to handle complex keys correctly
> ---
>
> Key: HUDI-900
> URL: https://issues.apache.org/jira/browse/HUDI-900
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 24h
>  Remaining Estimate: 0h
>
> Look at ComplexKeyGenerator. Make sure MetadataBootstrap is of same format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-942) Increase minmum number of delta commits for inline compaction

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-942.
--

> Increase minmum number of delta commits for inline compaction
> -
>
> Key: HUDI-942
> URL: https://issues.apache.org/jira/browse/HUDI-942
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Compaction
>Affects Versions: 0.6.0
>Reporter: Sathyaprakash Govindasamy
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Default value of number of delta commits required to do inline compaction 
> (hoodie.compact.inline.max.delta.commits) is currently 1. Because of this by 
> default every delta commit to MERGE ON READ table is doing inline compaction 
> as well automatically. 
> I think default value of 1 is little overkill and also it will make MERGE ON 
> READ work like COPY ON WRITE with compaction on every run. We should increate 
> the default value to more than 1.
> https://github.com/apache/hudi/blob/f34de3fb2738c8c36c937eba8df2a6848fafa886/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L100



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-900) Metadata Bootstrap Key Generator needs to handle complex keys correctly

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-900:


> Metadata Bootstrap Key Generator needs to handle complex keys correctly
> ---
>
> Key: HUDI-900
> URL: https://issues.apache.org/jira/browse/HUDI-900
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 24h
>  Remaining Estimate: 0h
>
> Look at ComplexKeyGenerator. Make sure MetadataBootstrap is of same format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-971) Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-971:


> Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean 
> partition name
> ---
>
> Key: HUDI-971
> URL: https://issues.apache.org/jira/browse/HUDI-971
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> When calling HFileBootstrapIndexReader.getIndexedPartitions(), it will return 
> unclean partitions because of 
> [https://github.com/apache/hbase/blob/rel/1.2.3/hbase-common/src/main/java/org/apache/hadoop/hbase/CellUtil.java#L768].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-971) Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-971.

Resolution: Fixed

> Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean 
> partition name
> ---
>
> Key: HUDI-971
> URL: https://issues.apache.org/jira/browse/HUDI-971
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> When calling HFileBootstrapIndexReader.getIndexedPartitions(), it will return 
> unclean partitions because of 
> [https://github.com/apache/hbase/blob/rel/1.2.3/hbase-common/src/main/java/org/apache/hadoop/hbase/CellUtil.java#L768].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1013) Bulk Insert w/o converting to RDD

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-1013.
---

> Bulk Insert w/o converting to RDD
> -
>
> Key: HUDI-1013
> URL: https://issues.apache.org/jira/browse/HUDI-1013
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Our bulk insert(not just bulk insert, all operations infact) does dataset to 
> rdd conversion in HoodieSparkSqlWriter and our HoodieClient deals with 
> JavaRDDs. We are trying to see if we can improve our 
> performance by avoiding the rdd conversion.  We will first start off w/ bulk 
> insert and get end to end working before we decide if we wanna do this for 
> other operations too after doing some perf analysis. 
>  
> On a high level, this is the idea
> 1. Dataset will be passed in all the way from spark sql writer to the 
> storage writer. We do not convert to HoodieRecord at any point in time. 
> 2. We need to use 
> [ParquetWriteSupport|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala]]
>  to write to Parquet as InternalRows.
> 3. So, gist of what we wanna do is, with the Datasets, sort by 
> partition path and record keys, repartition by parallelism config, and do 
> mapPartitions. Within MapPartitions, we will iterate through the Rows, encode 
> to InternalRows and write to Parquet using the write support linked above. 
> We first wanted to check if our strategy will actually improve the perf. So, 
> I did a quick hack of just the mapPartition func in HoodieSparkSqlWriter just 
> to see how the numbers look like. Check for operation 
> "bulk_insert_direct_parquet_write_support" 
> [here|#diff-5317f4121df875e406876f9f0f012fac]]. 
> These are the numbers I got. (1) is existing hoodie bulk insert which does 
> the rdd conversion to JavaRdd. (2) is writing directly to 
> parquet in spark. Code given below. (3) is the modified hoodie code i.e. 
> operation bulk_insert_direct_parquet_write_support)
>  
> | |5M records 100 parallelism input size 2.5 GB|
> |(1) Orig hoodie(unmodified)|169 secs. output size 2.7 GB|
> |(2) Parquet |62 secs. output size 2.5 GB|
> |(3) Modified hudi code. Direct Parquet Write |73 secs. output size 2.5 GB|
>  
> So, essentially our existing code for bulk insert is > 2x that of parquet. 
> Our modified hudi code (i.e. operation 
> bulk_insert_direct_parquet_write_support) is close to direct Parquet write in 
> spark, which shows that our strategy should work. 
> // This is the Parquet write in spark. (2) above. 
> transformedDF.sort(*"partition"*, *"key"*)
> .coalesce(parallelism)
>  .write.format(*"parquet"*)
>  .partitionBy(*"partition"*)
>  .mode(saveMode)
>  .save(*s"**$*outputPath*/**$*format*"*)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1013) Bulk Insert w/o converting to RDD

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-1013:
-

> Bulk Insert w/o converting to RDD
> -
>
> Key: HUDI-1013
> URL: https://issues.apache.org/jira/browse/HUDI-1013
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Our bulk insert(not just bulk insert, all operations infact) does dataset to 
> rdd conversion in HoodieSparkSqlWriter and our HoodieClient deals with 
> JavaRDDs. We are trying to see if we can improve our 
> performance by avoiding the rdd conversion.  We will first start off w/ bulk 
> insert and get end to end working before we decide if we wanna do this for 
> other operations too after doing some perf analysis. 
>  
> On a high level, this is the idea
> 1. Dataset will be passed in all the way from spark sql writer to the 
> storage writer. We do not convert to HoodieRecord at any point in time. 
> 2. We need to use 
> [ParquetWriteSupport|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala]]
>  to write to Parquet as InternalRows.
> 3. So, gist of what we wanna do is, with the Datasets, sort by 
> partition path and record keys, repartition by parallelism config, and do 
> mapPartitions. Within MapPartitions, we will iterate through the Rows, encode 
> to InternalRows and write to Parquet using the write support linked above. 
> We first wanted to check if our strategy will actually improve the perf. So, 
> I did a quick hack of just the mapPartition func in HoodieSparkSqlWriter just 
> to see how the numbers look like. Check for operation 
> "bulk_insert_direct_parquet_write_support" 
> [here|#diff-5317f4121df875e406876f9f0f012fac]]. 
> These are the numbers I got. (1) is existing hoodie bulk insert which does 
> the rdd conversion to JavaRdd. (2) is writing directly to 
> parquet in spark. Code given below. (3) is the modified hoodie code i.e. 
> operation bulk_insert_direct_parquet_write_support)
>  
> | |5M records 100 parallelism input size 2.5 GB|
> |(1) Orig hoodie(unmodified)|169 secs. output size 2.7 GB|
> |(2) Parquet |62 secs. output size 2.5 GB|
> |(3) Modified hudi code. Direct Parquet Write |73 secs. output size 2.5 GB|
>  
> So, essentially our existing code for bulk insert is > 2x that of parquet. 
> Our modified hudi code (i.e. operation 
> bulk_insert_direct_parquet_write_support) is close to direct Parquet write in 
> spark, which shows that our strategy should work. 
> // This is the Parquet write in spark. (2) above. 
> transformedDF.sort(*"partition"*, *"key"*)
> .coalesce(parallelism)
>  .write.format(*"parquet"*)
>  .partitionBy(*"partition"*)
>  .mode(saveMode)
>  .save(*s"**$*outputPath*/**$*format*"*)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1013) Bulk Insert w/o converting to RDD

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-1013.
-
Resolution: Fixed

> Bulk Insert w/o converting to RDD
> -
>
> Key: HUDI-1013
> URL: https://issues.apache.org/jira/browse/HUDI-1013
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Our bulk insert(not just bulk insert, all operations infact) does dataset to 
> rdd conversion in HoodieSparkSqlWriter and our HoodieClient deals with 
> JavaRDDs. We are trying to see if we can improve our 
> performance by avoiding the rdd conversion.  We will first start off w/ bulk 
> insert and get end to end working before we decide if we wanna do this for 
> other operations too after doing some perf analysis. 
>  
> On a high level, this is the idea
> 1. Dataset will be passed in all the way from spark sql writer to the 
> storage writer. We do not convert to HoodieRecord at any point in time. 
> 2. We need to use 
> [ParquetWriteSupport|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala]]
>  to write to Parquet as InternalRows.
> 3. So, gist of what we wanna do is, with the Datasets, sort by 
> partition path and record keys, repartition by parallelism config, and do 
> mapPartitions. Within MapPartitions, we will iterate through the Rows, encode 
> to InternalRows and write to Parquet using the write support linked above. 
> We first wanted to check if our strategy will actually improve the perf. So, 
> I did a quick hack of just the mapPartition func in HoodieSparkSqlWriter just 
> to see how the numbers look like. Check for operation 
> "bulk_insert_direct_parquet_write_support" 
> [here|#diff-5317f4121df875e406876f9f0f012fac]]. 
> These are the numbers I got. (1) is existing hoodie bulk insert which does 
> the rdd conversion to JavaRdd. (2) is writing directly to 
> parquet in spark. Code given below. (3) is the modified hoodie code i.e. 
> operation bulk_insert_direct_parquet_write_support)
>  
> | |5M records 100 parallelism input size 2.5 GB|
> |(1) Orig hoodie(unmodified)|169 secs. output size 2.7 GB|
> |(2) Parquet |62 secs. output size 2.5 GB|
> |(3) Modified hudi code. Direct Parquet Write |73 secs. output size 2.5 GB|
>  
> So, essentially our existing code for bulk insert is > 2x that of parquet. 
> Our modified hudi code (i.e. operation 
> bulk_insert_direct_parquet_write_support) is close to direct Parquet write in 
> spark, which shows that our strategy should work. 
> // This is the Parquet write in spark. (2) above. 
> transformedDF.sort(*"partition"*, *"key"*)
> .coalesce(parallelism)
>  .write.format(*"parquet"*)
>  .partitionBy(*"partition"*)
>  .mode(saveMode)
>  .save(*s"**$*outputPath*/**$*format*"*)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1100) Docs: Update index type in Configuration section and fix typo in deployment page

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-1100.
---

> Docs: Update index type in Configuration section and fix typo in deployment 
> page
> 
>
> Key: HUDI-1100
> URL: https://issues.apache.org/jira/browse/HUDI-1100
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Balaji Varadarajan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.6.0
>
>
> 1. Add GLOBAL_BLOOM as one of index type in Configuration section : 
>  [https://github.com/apache/hudi/issues/1771]
> 2. Fix typo in deployment section : 
> [https://github.com/apache/hudi/issues/1788]
>  * 
> [|https://issues.apache.org/jira/secure/AddComment!default.jspa?id=13316937]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1100) Docs: Update index type in Configuration section and fix typo in deployment page

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-1100:
-

> Docs: Update index type in Configuration section and fix typo in deployment 
> page
> 
>
> Key: HUDI-1100
> URL: https://issues.apache.org/jira/browse/HUDI-1100
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Balaji Varadarajan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.6.0
>
>
> 1. Add GLOBAL_BLOOM as one of index type in Configuration section : 
>  [https://github.com/apache/hudi/issues/1771]
> 2. Fix typo in deployment section : 
> [https://github.com/apache/hudi/issues/1788]
>  * 
> [|https://issues.apache.org/jira/secure/AddComment!default.jspa?id=13316937]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1100) Docs: Update index type in Configuration section and fix typo in deployment page

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1100:

Status: Closed  (was: Patch Available)

> Docs: Update index type in Configuration section and fix typo in deployment 
> page
> 
>
> Key: HUDI-1100
> URL: https://issues.apache.org/jira/browse/HUDI-1100
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Balaji Varadarajan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.6.0
>
>
> 1. Add GLOBAL_BLOOM as one of index type in Configuration section : 
>  [https://github.com/apache/hudi/issues/1771]
> 2. Fix typo in deployment section : 
> [https://github.com/apache/hudi/issues/1788]
>  * 
> [|https://issues.apache.org/jira/secure/AddComment!default.jspa?id=13316937]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1190) Annotate all public APIs classes with stability indication

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-1190.
---

> Annotate all public APIs classes with stability indication
> --
>
> Key: HUDI-1190
> URL: https://issues.apache.org/jira/browse/HUDI-1190
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1100) Docs: Update index type in Configuration section and fix typo in deployment page

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-1100.
-
Resolution: Fixed

> Docs: Update index type in Configuration section and fix typo in deployment 
> page
> 
>
> Key: HUDI-1100
> URL: https://issues.apache.org/jira/browse/HUDI-1100
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Balaji Varadarajan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.6.0
>
>
> 1. Add GLOBAL_BLOOM as one of index type in Configuration section : 
>  [https://github.com/apache/hudi/issues/1771]
> 2. Fix typo in deployment section : 
> [https://github.com/apache/hudi/issues/1788]
>  * 
> [|https://issues.apache.org/jira/secure/AddComment!default.jspa?id=13316937]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1098) Marker file finalizing may block on a data file that was never written

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-1098.
-
Resolution: Fixed

> Marker file finalizing may block on a data file that was never written
> --
>
> Key: HUDI-1098
> URL: https://issues.apache.org/jira/browse/HUDI-1098
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> {code:java}
> // Ensure all files in delete list is actually present. This is mandatory for 
> an eventually consistent FS. // Otherwise, we may miss deleting such files. 
> If files are not found even after retries, fail the commit 
> if (consistencyCheckEnabled) { 
>   // This will either ensure all files to be deleted are present. 
> waitForAllFiles(jsc, groupByPartition, FileVisibility.APPEAR); 
> }
> {code}
> We need to handle the case where marker file was created, but we crashed 
> before the data file was created. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1190) Annotate all public APIs classes with stability indication

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-1190:
-

> Annotate all public APIs classes with stability indication
> --
>
> Key: HUDI-1190
> URL: https://issues.apache.org/jira/browse/HUDI-1190
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1098) Marker file finalizing may block on a data file that was never written

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-1098:
-

> Marker file finalizing may block on a data file that was never written
> --
>
> Key: HUDI-1098
> URL: https://issues.apache.org/jira/browse/HUDI-1098
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> {code:java}
> // Ensure all files in delete list is actually present. This is mandatory for 
> an eventually consistent FS. // Otherwise, we may miss deleting such files. 
> If files are not found even after retries, fail the commit 
> if (consistencyCheckEnabled) { 
>   // This will either ensure all files to be deleted are present. 
> waitForAllFiles(jsc, groupByPartition, FileVisibility.APPEAR); 
> }
> {code}
> We need to handle the case where marker file was created, but we crashed 
> before the data file was created. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1068) HoodieGlobalBloomIndex does not correctly send deletes to older partition when partition path is updated

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-1068:
-

> HoodieGlobalBloomIndex does not correctly send deletes to older partition 
> when partition path is updated
> 
>
> Key: HUDI-1068
> URL: https://issues.apache.org/jira/browse/HUDI-1068
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> [https://github.com/apache/hudi/issues/1745]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1068) HoodieGlobalBloomIndex does not correctly send deletes to older partition when partition path is updated

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-1068.
-
Resolution: Fixed

> HoodieGlobalBloomIndex does not correctly send deletes to older partition 
> when partition path is updated
> 
>
> Key: HUDI-1068
> URL: https://issues.apache.org/jira/browse/HUDI-1068
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> [https://github.com/apache/hudi/issues/1745]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1190) Annotate all public APIs classes with stability indication

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-1190.
-
Resolution: Fixed

> Annotate all public APIs classes with stability indication
> --
>
> Key: HUDI-1190
> URL: https://issues.apache.org/jira/browse/HUDI-1190
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1098) Marker file finalizing may block on a data file that was never written

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-1098.
---

> Marker file finalizing may block on a data file that was never written
> --
>
> Key: HUDI-1098
> URL: https://issues.apache.org/jira/browse/HUDI-1098
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> {code:java}
> // Ensure all files in delete list is actually present. This is mandatory for 
> an eventually consistent FS. // Otherwise, we may miss deleting such files. 
> If files are not found even after retries, fail the commit 
> if (consistencyCheckEnabled) { 
>   // This will either ensure all files to be deleted are present. 
> waitForAllFiles(jsc, groupByPartition, FileVisibility.APPEAR); 
> }
> {code}
> We need to handle the case where marker file was created, but we crashed 
> before the data file was created. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1068) HoodieGlobalBloomIndex does not correctly send deletes to older partition when partition path is updated

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-1068.
---

> HoodieGlobalBloomIndex does not correctly send deletes to older partition 
> when partition path is updated
> 
>
> Key: HUDI-1068
> URL: https://issues.apache.org/jira/browse/HUDI-1068
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> [https://github.com/apache/hudi/issues/1745]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1054) Address performance issues with finalizing writes on S3

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-1054:
-

> Address performance issues with finalizing writes on S3
> ---
>
> Key: HUDI-1054
> URL: https://issues.apache.org/jira/browse/HUDI-1054
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: bootstrap, Common Core, Performance
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> I have identified 3 performance bottleneck in the 
> [finalizeWrite|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L378]
>  function, that are manifesting and becoming more prominent with the new 
> bootstrap mechanism on S3:
>  * 
> [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L425]
>   is a serial operation performed at the driver and it can take a long time 
> when you have several partitions and large number of files.
>  * The invalid data paths are being stored in a List instead of Set and as a 
> result the following operation becomes N^2 taking significant time to compute 
> at the driver: 
> [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L429]
>  * 
> [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L473]
>  does a recursive delete of the marker directory at the driver. This is again 
> extremely expensive when you have large number of partitions and files.
>  
> Upon testing with a 1 TB data set, having 8000 partitions and approximately 
> 19 files this whole process consumes *35 minutes*. There is scope to 
> address these performance issues with spark parallelization and using 
> appropriate data structures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1014) Design and Implement upgrade-downgrade infrastrucutre

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-1014.
-
Resolution: Fixed

> Design and Implement upgrade-downgrade infrastrucutre
> -
>
> Key: HUDI-1014
> URL: https://issues.apache.org/jira/browse/HUDI-1014
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1014) Design and Implement upgrade-downgrade infrastrucutre

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-1014:
-

> Design and Implement upgrade-downgrade infrastrucutre
> -
>
> Key: HUDI-1014
> URL: https://issues.apache.org/jira/browse/HUDI-1014
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1054) Address performance issues with finalizing writes on S3

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-1054.
-
Resolution: Fixed

> Address performance issues with finalizing writes on S3
> ---
>
> Key: HUDI-1054
> URL: https://issues.apache.org/jira/browse/HUDI-1054
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: bootstrap, Common Core, Performance
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> I have identified 3 performance bottleneck in the 
> [finalizeWrite|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L378]
>  function, that are manifesting and becoming more prominent with the new 
> bootstrap mechanism on S3:
>  * 
> [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L425]
>   is a serial operation performed at the driver and it can take a long time 
> when you have several partitions and large number of files.
>  * The invalid data paths are being stored in a List instead of Set and as a 
> result the following operation becomes N^2 taking significant time to compute 
> at the driver: 
> [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L429]
>  * 
> [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L473]
>  does a recursive delete of the marker directory at the driver. This is again 
> extremely expensive when you have large number of partitions and files.
>  
> Upon testing with a 1 TB data set, having 8000 partitions and approximately 
> 19 files this whole process consumes *35 minutes*. There is scope to 
> address these performance issues with spark parallelization and using 
> appropriate data structures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1014) Design and Implement upgrade-downgrade infrastrucutre

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-1014.
---

> Design and Implement upgrade-downgrade infrastrucutre
> -
>
> Key: HUDI-1014
> URL: https://issues.apache.org/jira/browse/HUDI-1014
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1054) Address performance issues with finalizing writes on S3

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-1054.
---

> Address performance issues with finalizing writes on S3
> ---
>
> Key: HUDI-1054
> URL: https://issues.apache.org/jira/browse/HUDI-1054
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: bootstrap, Common Core, Performance
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> I have identified 3 performance bottleneck in the 
> [finalizeWrite|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L378]
>  function, that are manifesting and becoming more prominent with the new 
> bootstrap mechanism on S3:
>  * 
> [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L425]
>   is a serial operation performed at the driver and it can take a long time 
> when you have several partitions and large number of files.
>  * The invalid data paths are being stored in a List instead of Set and as a 
> result the following operation becomes N^2 taking significant time to compute 
> at the driver: 
> [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L429]
>  * 
> [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L473]
>  does a recursive delete of the marker directory at the driver. This is again 
> extremely expensive when you have large number of partitions and files.
>  
> Upon testing with a 1 TB data set, having 8000 partitions and approximately 
> 19 files this whole process consumes *35 minutes*. There is scope to 
> address these performance issues with spark parallelization and using 
> appropriate data structures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] satishkotha opened a new pull request #2023: [HUDI-1137] Add option to configure different path selector

2020-08-24 Thread GitBox


satishkotha opened a new pull request #2023:
URL: https://github.com/apache/hudi/pull/2023


   ## What is the purpose of the pull request
   
   Add configuration option to pick different path selector
   
   ## Brief change log
   
   Add configuration option to pick different path selector. This is useful for 
replace and insert overwrite (RFC-18/RFC-19 still in progress)
   
   
   ## Verify this pull request
   
   This is test suite only change. This can be verified with test suite dags
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1137) [Test Suite] Add option to configure different path selector

2020-08-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1137:
-
Labels: pull-request-available  (was: )

> [Test Suite] Add option to configure different path selector
> 
>
> Key: HUDI-1137
> URL: https://issues.apache.org/jira/browse/HUDI-1137
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Nishith Agarwal
>Assignee: satish
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[hudi] branch asf-site updated: [DOC] Doc changes for release 0.6.0 (#2011)

2020-08-24 Thread bhavanisudha
This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 0ac5f3f  [DOC] Doc changes for release 0.6.0  (#2011)
0ac5f3f is described below

commit 0ac5f3f4e20cee484412ed89e6631b2171196f0c
Author: Bhavani Sudha Saktheeswaran 
AuthorDate: Mon Aug 24 11:05:13 2020 -0700

[DOC] Doc changes for release 0.6.0  (#2011)

* [DOC] Change instructions and queries supported by PrestoDB

* Adding video and blog from 'PrestoDB and Apache Hudi' talk on Presto 
Meetup

* Config page changes

- Add doc for using jdbc during hive sync
- Fix index types to include all avialable indexes
- Fix default val for hoodie.copyonwrite.insert.auto.split
- Add doc for user defined bulk insert partitioner class
- Add simple index configs
- Reorder all index configs to be grouped together
- Add docs for auto cleanign and async cleaning
- Add docs for rollback parallelism and marker based rollback
- Add doc for bulk-insert sort modes
- Add doc for markers delete parallelism

* CR feedback

Co-authored-by: Vinoth Chandar 
---
 docs/_docs/1_2_structure.md|  2 +-
 docs/_docs/1_4_powered_by.md   |  3 ++
 docs/_docs/1_5_comparison.md   |  4 +-
 docs/_docs/2_3_querying_data.cn.md |  8 ++--
 docs/_docs/2_3_querying_data.md| 21 ++---
 docs/_docs/2_4_configurations.md   | 88 ++
 docs/_docs/2_6_deployment.md   |  6 +--
 7 files changed, 107 insertions(+), 25 deletions(-)

diff --git a/docs/_docs/1_2_structure.md b/docs/_docs/1_2_structure.md
index ddcdb1a..1c59960 100644
--- a/docs/_docs/1_2_structure.md
+++ b/docs/_docs/1_2_structure.md
@@ -16,6 +16,6 @@ Hudi (pronounced “Hoodie”) ingests & manages storage of large 
analytical tab
 
 
 
-By carefully managing how data is laid out in storage & how it’s exposed to 
queries, Hudi is able to power a rich data ecosystem where external sources can 
be ingested in near real-time and made available for interactive SQL Engines 
like [Presto](https://prestodb.io) & [Spark](https://spark.apache.org/sql/), 
while at the same time capable of being consumed incrementally from 
processing/ETL frameworks like [Hive](https://hive.apache.org/) & 
[Spark](https://spark.apache.org/docs/latest/) t [...]
+By carefully managing how data is laid out in storage & how it’s exposed to 
queries, Hudi is able to power a rich data ecosystem where external sources can 
be ingested in near real-time and made available for interactive SQL Engines 
like [PrestoDB](https://prestodb.io) & [Spark](https://spark.apache.org/sql/), 
while at the same time capable of being consumed incrementally from 
processing/ETL frameworks like [Hive](https://hive.apache.org/) & 
[Spark](https://spark.apache.org/docs/latest/) [...]
 
 Hudi broadly consists of a self contained Spark library to build tables and 
integrations with existing query engines for data access. See 
[quickstart](/docs/quick-start-guide) for a demo.
diff --git a/docs/_docs/1_4_powered_by.md b/docs/_docs/1_4_powered_by.md
index a731979..8e093a4 100644
--- a/docs/_docs/1_4_powered_by.md
+++ b/docs/_docs/1_4_powered_by.md
@@ -113,6 +113,8 @@ Using Hudi at Yotpo for several usages. Firstly, integrated 
Hudi as a writer in
 
 14. ["Apache Hudi - Design/Code Walkthrough Session for 
Contributors"](https://www.youtube.com/watch?v=N2eDfU_rQ_U) - By Vinoth 
Chandar, July 2020, Hudi community.
 
+15. ["PrestoDB and Apache Hudi"](https://youtu.be/nA3rwOdmm3A) - By Bhavani 
Sudha Saktheeswaran and Brandon Scheller, Aug 2020, PrestoDB Community Meetup.
+
 ## Articles
 
 1. ["The Case for incremental processing on 
Hadoop"](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)
 - O'reilly Ideas article by Vinoth Chandar
@@ -122,6 +124,7 @@ Using Hudi at Yotpo for several usages. Firstly, integrated 
Hudi as a writer in
 5. ["Apache Hudi grows cloud data lake 
maturity"](https://searchdatamanagement.techtarget.com/news/252484740/Apache-Hudi-grows-cloud-data-lake-maturity)
 6. ["Building a Large-scale Transactional Data Lake at Uber Using Apache 
Hudi"](https://eng.uber.com/apache-hudi-graduation/) - Uber eng blog by Nishith 
Agarwal
 7. ["Hudi On 
Hops"](https://www.diva-portal.org/smash/get/diva2:1413103/FULLTEXT01.pdf) - By 
NETSANET GEBRETSADKAN KIDANE
+8. ["PrestoDB and Apachi 
Hudi](https://prestodb.io/blog/2020/08/04/prestodb-and-hudi) - PrestoDB - Hudi 
integration blog by Bhavani Sudha Saktheeswaran and Brandon Scheller 
 
 ## Powered by
 
diff --git a/docs/_docs/1_5_comparison.md b/docs/_docs/1_5_comparison.md
index 32b73c6..41131a8 100644
--- a/docs/_docs/1_5_comparison.md
+++ b/docs/_docs/1_5_comparison.md
@@ -31,7 +31,7 @@ we expect Hudi to positioned at something that ingests 
parquet with superior per
 Hi

[GitHub] [hudi] bhasudha merged pull request #2011: [DOC] Doc changes for release 0.6.0

2020-08-24 Thread GitBox


bhasudha merged pull request #2011:
URL: https://github.com/apache/hudi/pull/2011


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar closed issue #1947: datadog monitor hudi

2020-08-24 Thread GitBox


bvaradar closed issue #1947:
URL: https://github.com/apache/hudi/issues/1947


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1947: datadog monitor hudi

2020-08-24 Thread GitBox


bvaradar commented on issue #1947:
URL: https://github.com/apache/hudi/issues/1947#issuecomment-679276189


   @cun123 : Please reopen if the suggestion @xushiyan  did not work.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2002: [SUPPORT] Inconsistent Commits between CLI and Incremental Query

2020-08-24 Thread GitBox


bvaradar commented on issue #2002:
URL: https://github.com/apache/hudi/issues/2002#issuecomment-679275414


   @jpugliesi : It looks like 2nd and 3rd upsert updated the same set of 
records (generateUpdates()). In this case, all those records will be updated 
with latest commit time and incremental query will only show the commit time of 
3rd upsert. Hope, this is clear. Please reopen if this does not make sense to 
you.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar closed issue #2002: [SUPPORT] Inconsistent Commits between CLI and Incremental Query

2020-08-24 Thread GitBox


bvaradar closed issue #2002:
URL: https://github.com/apache/hudi/issues/2002


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-945) Cleanup spillable map files eagerly as part of close

2020-08-24 Thread Sreeram Ramji (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sreeram Ramji reassigned HUDI-945:
--

Assignee: Sreeram Ramji

> Cleanup spillable map files eagerly as part of close
> 
>
> Key: HUDI-945
> URL: https://issues.apache.org/jira/browse/HUDI-945
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Sreeram Ramji
>Priority: Major
> Fix For: 0.6.1
>
>
> Currently, files used by external spillable map or deleted on exits. For 
> spark-streaming/deltastreamer continuous-mode cases which runs several 
> iterations, it is better to eagerly delete files on closing the handles using 
> it. 
> We need to eagerly delete the files on following cases:
>  # MergeHandle
>  # HoodieMergedLogRecordScanner
>  # SpillableMapBasedFileSystemView



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-945) Cleanup spillable map files eagerly as part of close

2020-08-24 Thread Sreeram Ramji (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183504#comment-17183504
 ] 

Sreeram Ramji commented on HUDI-945:


[~vbalaji] I am interested in working on this. Assigning it to myself

> Cleanup spillable map files eagerly as part of close
> 
>
> Key: HUDI-945
> URL: https://issues.apache.org/jira/browse/HUDI-945
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> Currently, files used by external spillable map or deleted on exits. For 
> spark-streaming/deltastreamer continuous-mode cases which runs several 
> iterations, it is better to eagerly delete files on closing the handles using 
> it. 
> We need to eagerly delete the files on following cases:
>  # MergeHandle
>  # HoodieMergedLogRecordScanner
>  # SpillableMapBasedFileSystemView



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-945) Cleanup spillable map files eagerly as part of close

2020-08-24 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183499#comment-17183499
 ] 

Balaji Varadarajan commented on HUDI-945:
-

[~baobaoyeye]: Let me know if you are interested in submitting a PR. 

> Cleanup spillable map files eagerly as part of close
> 
>
> Key: HUDI-945
> URL: https://issues.apache.org/jira/browse/HUDI-945
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> Currently, files used by external spillable map or deleted on exits. For 
> spark-streaming/deltastreamer continuous-mode cases which runs several 
> iterations, it is better to eagerly delete files on closing the handles using 
> it. 
> We need to eagerly delete the files on following cases:
>  # MergeHandle
>  # HoodieMergedLogRecordScanner
>  # SpillableMapBasedFileSystemView



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1190) Annotate all public APIs classes with stability indication

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-1190.
-
Resolution: Fixed

> Annotate all public APIs classes with stability indication
> --
>
> Key: HUDI-1190
> URL: https://issues.apache.org/jira/browse/HUDI-1190
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1190) Annotate all public APIs classes with stability indication

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-1190:
-

> Annotate all public APIs classes with stability indication
> --
>
> Key: HUDI-1190
> URL: https://issues.apache.org/jira/browse/HUDI-1190
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1190) Annotate all public APIs classes with stability indication

2020-08-24 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha closed HUDI-1190.
---

> Annotate all public APIs classes with stability indication
> --
>
> Key: HUDI-1190
> URL: https://issues.apache.org/jira/browse/HUDI-1190
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] bvaradar commented on issue #2001: NPE While writing data to same partition on S3

2020-08-24 Thread GitBox


bvaradar commented on issue #2001:
URL: https://github.com/apache/hudi/issues/2001#issuecomment-679264433


   Can you turn on INFO level logs and attach the logs to debug this ?
   
   Thanks,
   Balaji.V



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] ashishmgofficial commented on issue #2007: [SUPPORT] Is Timeline metadata queryable ?

2020-08-24 Thread GitBox


ashishmgofficial commented on issue #2007:
URL: https://github.com/apache/hudi/issues/2007#issuecomment-679264539


   Our exact process is that, 
   
   -  We have a Kafka Topic producing data
   -  Target is Ceph object store
   
   So, we need timeline data like
   
   -  Source Topic
   -  Target Hudi Table on Ceph
   -  Other details like timestamp related, size etc
   
   Another Query is that , does hudi currently support, exporting this data 
from Hive CLI to some other database or as Files?
   
   Also, other than CLI , is any sorts of APIs are exposed for retrieving these 
data programattically?




This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2019: Leak in DiskBasedMap

2020-08-24 Thread GitBox


bvaradar commented on issue #2019:
URL: https://github.com/apache/hudi/issues/2019#issuecomment-679261953


   This is currently tracked in https://issues.apache.org/jira/browse/HUDI-945
   
   Should be a simple fix as mentioned in the jira. Are you interested in 
submitting a PR ? We will have this fixed in next release.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Comment Edited] (HUDI-945) Cleanup spillable map files eagerly as part of close

2020-08-24 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183488#comment-17183488
 ] 

Balaji Varadarajan edited comment on HUDI-945 at 8/24/20, 5:25 PM:
---

[https://github.com/apache/hudi/issues/2019]


was (Author: vbalaji):
https://issues.apache.org/jira/browse/HUDI-945

> Cleanup spillable map files eagerly as part of close
> 
>
> Key: HUDI-945
> URL: https://issues.apache.org/jira/browse/HUDI-945
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> Currently, files used by external spillable map or deleted on exits. For 
> spark-streaming/deltastreamer continuous-mode cases which runs several 
> iterations, it is better to eagerly delete files on closing the handles using 
> it. 
> We need to eagerly delete the files on following cases:
>  # MergeHandle
>  # HoodieMergedLogRecordScanner
>  # SpillableMapBasedFileSystemView



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-945) Cleanup spillable map files eagerly as part of close

2020-08-24 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183488#comment-17183488
 ] 

Balaji Varadarajan commented on HUDI-945:
-

https://issues.apache.org/jira/browse/HUDI-945

> Cleanup spillable map files eagerly as part of close
> 
>
> Key: HUDI-945
> URL: https://issues.apache.org/jira/browse/HUDI-945
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> Currently, files used by external spillable map or deleted on exits. For 
> spark-streaming/deltastreamer continuous-mode cases which runs several 
> iterations, it is better to eagerly delete files on closing the handles using 
> it. 
> We need to eagerly delete the files on following cases:
>  # MergeHandle
>  # HoodieMergedLogRecordScanner
>  # SpillableMapBasedFileSystemView



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-945) Cleanup spillable map files eagerly as part of close

2020-08-24 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183483#comment-17183483
 ] 

Balaji Varadarajan commented on HUDI-945:
-

[~baobaoyeye]: Sorry for the delay. I missed this message. The idea is correct 
but we need to implement such cleanup using "clear()" method as it is part of 
Map interface and call it explicitly.

> Cleanup spillable map files eagerly as part of close
> 
>
> Key: HUDI-945
> URL: https://issues.apache.org/jira/browse/HUDI-945
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> Currently, files used by external spillable map or deleted on exits. For 
> spark-streaming/deltastreamer continuous-mode cases which runs several 
> iterations, it is better to eagerly delete files on closing the handles using 
> it. 
> We need to eagerly delete the files on following cases:
>  # MergeHandle
>  # HoodieMergedLogRecordScanner
>  # SpillableMapBasedFileSystemView



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-945) Cleanup spillable map files eagerly as part of close

2020-08-24 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-945:

Status: Open  (was: New)

> Cleanup spillable map files eagerly as part of close
> 
>
> Key: HUDI-945
> URL: https://issues.apache.org/jira/browse/HUDI-945
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> Currently, files used by external spillable map or deleted on exits. For 
> spark-streaming/deltastreamer continuous-mode cases which runs several 
> iterations, it is better to eagerly delete files on closing the handles using 
> it. 
> We need to eagerly delete the files on following cases:
>  # MergeHandle
>  # HoodieMergedLogRecordScanner
>  # SpillableMapBasedFileSystemView



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1221) Ensure docker demo page reflects the latest support on all query engines

2020-08-24 Thread Bhavani Sudha (Jira)
Bhavani Sudha created HUDI-1221:
---

 Summary: Ensure docker demo page reflects the latest support on 
all query engines
 Key: HUDI-1221
 URL: https://issues.apache.org/jira/browse/HUDI-1221
 Project: Apache Hudi
  Issue Type: Task
  Components: Docs, docs-chinese
Reporter: Bhavani Sudha
Assignee: vinoyang
 Fix For: 0.6.1


This page does not get edited often. With recent support for more table 
querying via spark datasource and presto, we need to ensure all recent changes 
are reflected here as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] bvaradar commented on issue #2020: [SUPPORT] Compaction fails with "java.io.FileNotFoundException"

2020-08-24 Thread GitBox


bvaradar commented on issue #2020:
URL: https://github.com/apache/hudi/issues/2020#issuecomment-679253153


   Can you please add the details of 
   "commit showfiles --commit 20200821153748"
   
   Are you running with consistency check enabled ?
   
   Can you also check if the file is actually absent by listing the folder 
s3://myBucket/absolute_path_to/daas_date=2020-05/
   
   Also, paste the output of listing in this issue.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-1220) Leverage java docs for publishing to config pages in site

2020-08-24 Thread Bhavani Sudha (Jira)
Bhavani Sudha created HUDI-1220:
---

 Summary: Leverage java docs for publishing to config pages in site
 Key: HUDI-1220
 URL: https://issues.apache.org/jira/browse/HUDI-1220
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Bhavani Sudha
 Fix For: 0.6.1






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] bvaradar closed issue #2017: multi-level partition

2020-08-24 Thread GitBox


bvaradar closed issue #2017:
URL: https://github.com/apache/hudi/issues/2017


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2017: multi-level partition

2020-08-24 Thread GitBox


bvaradar commented on issue #2017:
URL: https://github.com/apache/hudi/issues/2017#issuecomment-679249338


   Please reopen if this doesnt solve the problem



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2017: multi-level partition

2020-08-24 Thread GitBox


bvaradar commented on issue #2017:
URL: https://github.com/apache/hudi/issues/2017#issuecomment-679249107


   'hoodie.datasource.write.partitionpath.field' should be  'year,month'. 
Please make sure you use ComplexKeyGenerator
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2007: [SUPPORT] Is Timeline metadata queryable ?

2020-08-24 Thread GitBox


bvaradar commented on issue #2007:
URL: https://github.com/apache/hudi/issues/2007#issuecomment-679248311


   Timeline metadata is available through Hudi CLI currently. Although, I 
believe there is already someone looking at supporting this feature  cc @n3nash 
@satishkotha @prashantwason 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1960: How do you change the 'hoodie.datasource.write.payload.class' configuration property?

2020-08-24 Thread GitBox


bvaradar commented on issue #1960:
URL: https://github.com/apache/hudi/issues/1960#issuecomment-679236183


   Closing this issue as we have a jira to track.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar closed issue #1960: How do you change the 'hoodie.datasource.write.payload.class' configuration property?

2020-08-24 Thread GitBox


bvaradar closed issue #1960:
URL: https://github.com/apache/hudi/issues/1960


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1955: [SUPPORT] DMS partition treated as part of pk

2020-08-24 Thread GitBox


bvaradar commented on issue #1955:
URL: https://github.com/apache/hudi/issues/1955#issuecomment-679235853


   @nsivabalan : Please take a look when you get a chance.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar closed issue #1948: [SUPPORT] DMS example complains about dfs-source.properties

2020-08-24 Thread GitBox


bvaradar closed issue #1948:
URL: https://github.com/apache/hudi/issues/1948


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [MINOR] Update README.md (#2010)

2020-08-24 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 111a975  [MINOR] Update README.md (#2010)
111a975 is described below

commit 111a9753a0dc8a8634a5aaf549ceff5cbd89bcb8
Author: Raymond Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Mon Aug 24 09:28:29 2020 -0700

[MINOR] Update README.md (#2010)

- add maven profile to test running commands
- remove -DskipITs for packaging commands
---
 README.md | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/README.md b/README.md
index 542ff2e..2416b3f 100644
--- a/README.md
+++ b/README.md
@@ -54,7 +54,7 @@ Prerequisites for building Apache Hudi:
 ```
 # Checkout code and build
 git clone https://github.com/apache/hudi.git && cd hudi
-mvn clean package -DskipTests -DskipITs
+mvn clean package -DskipTests
 
 # Start command
 spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
@@ -73,7 +73,7 @@ mvn clean javadoc:aggregate -Pjavadocs
 The default Scala version supported is 2.11. To build for Scala 2.12 version, 
build using `scala-2.12` profile
 
 ```
-mvn clean package -DskipTests -DskipITs -Dscala-2.12
+mvn clean package -DskipTests -Dscala-2.12
 ```
 
 ### Build without spark-avro module
@@ -83,7 +83,7 @@ The default hudi-jar bundles spark-avro module. To build 
without spark-avro modu
 ```
 # Checkout code and build
 git clone https://github.com/apache/hudi.git && cd hudi
-mvn clean package -DskipTests -DskipITs -Pspark-shade-unbundle-avro
+mvn clean package -DskipTests -Pspark-shade-unbundle-avro
 
 # Start command
 spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
@@ -94,14 +94,19 @@ spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
 
 ## Running Tests
 
-All tests can be run with maven
+Unit tests can be run with maven profile `unit-tests`.
 ```
-mvn test
+mvn -Punit-tests test
+```
+
+Functional tests, which are tagged with `@Tag("functional")`, can be run with 
maven profile `functional-tests`.
+```
+mvn -Pfunctional-tests test
 ```
 
 To run tests with spark event logging enabled, define the Spark event log 
directory. This allows visualizing test DAG and stages using Spark History 
Server UI.
 ```
-mvn test -DSPARK_EVLOG_DIR=/path/for/spark/event/log
+mvn -Punit-tests test -DSPARK_EVLOG_DIR=/path/for/spark/event/log
 ```
 
 ## Quickstart



[GitHub] [hudi] xushiyan merged pull request #2010: [MINOR] Update README.md

2020-08-24 Thread GitBox


xushiyan merged pull request #2010:
URL: https://github.com/apache/hudi/pull/2010


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #2010: [MINOR] Update README.md

2020-08-24 Thread GitBox


vinothchandar commented on pull request #2010:
URL: https://github.com/apache/hudi/pull/2010#issuecomment-679231806


   @xushiyan can you try merging yourself using squash and merge? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




svn commit: r41095 - in /release/hudi: 0.6.0/ hudi-0.6.0/

2020-08-24 Thread bhavanisudha
Author: bhavanisudha
Date: Mon Aug 24 16:08:47 2020
New Revision: 41095

Log:
Apache Hudi 0.6.0 source release. Rename hudi-0.6.0 to 0.6.0

Added:
release/hudi/0.6.0/
  - copied from r41094, release/hudi/hudi-0.6.0/
Removed:
release/hudi/hudi-0.6.0/



[GitHub] [hudi] Trevor-zhang commented on pull request #2022: Code optimization on hudi-common moudle

2020-08-24 Thread GitBox


Trevor-zhang commented on pull request #2022:
URL: https://github.com/apache/hudi/pull/2022#issuecomment-679203797


   hi, @yanghua ,can u take a look when fre?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Trevor-zhang opened a new pull request #2022: Code optimization on hudi-common moudle

2020-08-24 Thread GitBox


Trevor-zhang opened a new pull request #2022:
URL: https://github.com/apache/hudi/pull/2022


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *Code optimization on hudi-common moudle*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1219) Code optimization on hudi-common moudle

2020-08-24 Thread Trevorzhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Trevorzhang updated HUDI-1219:
--
Summary: Code optimization on hudi-common moudle  (was: Code optimization)

> Code optimization on hudi-common moudle
> ---
>
> Key: HUDI-1219
> URL: https://issues.apache.org/jira/browse/HUDI-1219
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Trevorzhang
>Assignee: Trevorzhang
>Priority: Minor
> Fix For: 0.6.1
>
>
> Optimized writing, delete redundant code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


<    1   2   3   >