date:20220328

[GitHub] [hudi] hudi-bot commented on pull request #3071: [HUDI-1976] Resolve Hive and Jackson vulnerability

2022-03-28 Thread GitBox



hudi-bot commented on pull request #3071:
URL: https://github.com/apache/hudi/pull/3071#issuecomment-1081434814


   
   ## CI report:
   
   * 9a8be2fd9d42d207314efa88f5315a435f1c917d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7028)
 
   * acad9f93025c93507ee22da2bfa505f7c80055f1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7495)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-3721) Metadata table blocks rollback and restore to savepoint before bootstrapped/init commit

2022-03-28 Thread Ethan Guo (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513824#comment-17513824
 ] 

Ethan Guo commented on HUDI-3721:
-

Per discussion offline, a simpler approach would be to delete the MDT before 
restoring to a savepoint before MDT bootstrap/init.  This can be a manual step 
by user or a step in savepoint operation itself.

> Metadata table blocks rollback and restore to savepoint before 
> bootstrapped/init commit
> ---
>
> Key: HUDI-3721
> URL: https://issues.apache.org/jira/browse/HUDI-3721
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Say the table has instants C1 to C4, and the data table does not have MDT 
> enabled.  After C4, the writer enables MDT so MDT has DC4 as the first 
> instant with the same timestamp as C4.  The rollback of any commit before C4 
> on data table is going to fail now due to the following check.  This is going 
> to fail restore to a savepoint before C4 as well if there is any.  Yet the 
> check itself can be relaxed in such a case to allow the rollback to get 
> through.
> {code:java}
> C1 C2 C3 C4
>   | metadata table init
>   // Case 2: The instant-to-rollback was never committed to Metadata 
> Table. This can happen if the instant-to-rollback
>   // was a failed commit (never completed) as only completed instants are 
> synced to Metadata Table.
>   // But the required Metadata Table instants should not have been 
> archived
>   HoodieInstant syncedInstant = new HoodieInstant(false, 
> HoodieTimeline.DELTA_COMMIT_ACTION, instantToRollback);
>   if 
> (metadataTableTimeline.getCommitsTimeline().isBeforeTimelineStarts(syncedInstant.getTimestamp()))
>  {
> throw new HoodieMetadataException(String.format("The instant %s 
> required to sync rollback of %s has been archived",
> syncedInstant, instantToRollback));
>   }{code}
> Prashant proposes to make the bootstrap commit of metadata table have a 
> specific suffix (just like we have 001 and 002 for compaction and clean). 
> This will make it trivial to detect such cases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3632) ensure Deltastreamer writes succeed if a target base path exists, but w/ no contents

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3632:
-
Issue Type: Improvement  (was: Task)

> ensure Deltastreamer writes succeed if a target base path exists, but w/ no 
> contents
> 
>
> Key: HUDI-3632
> URL: https://issues.apache.org/jira/browse/HUDI-3632
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Minor
> Fix For: 0.11.0
>
>
> If target table dir exists and if its empty, deltastreamer throws 
> TableValidity exception
> we might need to fix it. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3632) ensure Deltastreamer writes succeed if a target base path exists, but w/ no contents

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3632:
-
Priority: Minor  (was: Major)

> ensure Deltastreamer writes succeed if a target base path exists, but w/ no 
> contents
> 
>
> Key: HUDI-3632
> URL: https://issues.apache.org/jira/browse/HUDI-3632
> Project: Apache Hudi
>  Issue Type: Task
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Minor
> Fix For: 0.11.0
>
>
> If target table dir exists and if its empty, deltastreamer throws 
> TableValidity exception
> we might need to fix it. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3632) ensure Deltastreamer writes succeed if a target base path exists, but w/ no contents

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3632:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)

> ensure Deltastreamer writes succeed if a target base path exists, but w/ no 
> contents
> 
>
> Key: HUDI-3632
> URL: https://issues.apache.org/jira/browse/HUDI-3632
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Minor
> Fix For: 0.12.0
>
>
> If target table dir exists and if its empty, deltastreamer throws 
> TableValidity exception
> we might need to fix it. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3616) Ingestigate mor async compact integ test failure

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3616:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)

> Ingestigate mor async compact integ test failure
> 
>
> Key: HUDI-3616
> URL: https://issues.apache.org/jira/browse/HUDI-3616
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.0
>
>
> mor async compact integ test validation is failing. 
>  
> {code:java}
> 22/03/14 01:31:28 WARN DagNode: Validation using data from input path 
> /home/hadoop/staging/input//*/*
> 266722/03/14 01:31:28 INFO ValidateDatasetNode: Validate data in target hudi 
> path /home/hadoop/staging/output//*/*/*
> 266822/03/14 01:31:31 ERROR DagNode: Data set validation failed. Total count 
> in hudi 64400, input df count 64400
> 266922/03/14 01:31:31 INFO DagScheduler: Forcing shutdown of executor 
> service, this might kill running tasks
> 267022/03/14 01:31:31 ERROR HoodieTestSuiteJob: Failed to run Test Suite 
> 2671java.util.concurrent.ExecutionException: java.lang.AssertionError: Hudi 
> contents does not match contents input data. 
> 2672at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> 2673at java.util.concurrent.FutureTask.get(FutureTask.java:206)
> 2674at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.execute(DagScheduler.java:113)
> 2675at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.schedule(DagScheduler.java:68)
> 2676at 
> org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.runTestSuite(HoodieTestSuiteJob.java:203)
> 2677at 
> org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.main(HoodieTestSuiteJob.java:170)
> 2678at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2679at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2680at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2681at java.lang.reflect.Method.invoke(Method.java:498)
> 2682at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> 2683at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
> 2684at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> 2685at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> 2686at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> 2687at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
> 2688at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
> 2689at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 2690Caused by: java.lang.AssertionError: Hudi contents does not match 
> contents input data. 
> 2691at 
> org.apache.hudi.integ.testsuite.dag.nodes.BaseValidateDatasetNode.execute(BaseValidateDatasetNode.java:119)
> 2692at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139)
> 2693at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105)
> 2694at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 2695at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2696at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 2697at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 2698at java.lang.Thread.run(Thread.java:748)
> 2699Exception in thread "main" org.apache.hudi.exception.HoodieException: 
> Failed to run Test Suite 
> 2700at 
> org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.runTestSuite(HoodieTestSuiteJob.java:208)
> 2701at 
> org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.main(HoodieTestSuiteJob.java:170)
> 2702at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2703at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2704at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2705at java.lang.reflect.Method.invoke(Method.java:498)
> 2706at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> 2707at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
> 2708at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> 2709at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> 2710at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> 2711at 
>

[jira] [Updated] (HUDI-3616) Ingestigate mor async compact integ test failure

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3616:
-
Priority: Critical  (was: Major)

> Ingestigate mor async compact integ test failure
> 
>
> Key: HUDI-3616
> URL: https://issues.apache.org/jira/browse/HUDI-3616
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.12.0
>
>
> mor async compact integ test validation is failing. 
>  
> {code:java}
> 22/03/14 01:31:28 WARN DagNode: Validation using data from input path 
> /home/hadoop/staging/input//*/*
> 266722/03/14 01:31:28 INFO ValidateDatasetNode: Validate data in target hudi 
> path /home/hadoop/staging/output//*/*/*
> 266822/03/14 01:31:31 ERROR DagNode: Data set validation failed. Total count 
> in hudi 64400, input df count 64400
> 266922/03/14 01:31:31 INFO DagScheduler: Forcing shutdown of executor 
> service, this might kill running tasks
> 267022/03/14 01:31:31 ERROR HoodieTestSuiteJob: Failed to run Test Suite 
> 2671java.util.concurrent.ExecutionException: java.lang.AssertionError: Hudi 
> contents does not match contents input data. 
> 2672at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> 2673at java.util.concurrent.FutureTask.get(FutureTask.java:206)
> 2674at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.execute(DagScheduler.java:113)
> 2675at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.schedule(DagScheduler.java:68)
> 2676at 
> org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.runTestSuite(HoodieTestSuiteJob.java:203)
> 2677at 
> org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.main(HoodieTestSuiteJob.java:170)
> 2678at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2679at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2680at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2681at java.lang.reflect.Method.invoke(Method.java:498)
> 2682at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> 2683at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
> 2684at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> 2685at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> 2686at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> 2687at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
> 2688at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
> 2689at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 2690Caused by: java.lang.AssertionError: Hudi contents does not match 
> contents input data. 
> 2691at 
> org.apache.hudi.integ.testsuite.dag.nodes.BaseValidateDatasetNode.execute(BaseValidateDatasetNode.java:119)
> 2692at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139)
> 2693at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105)
> 2694at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 2695at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2696at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 2697at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 2698at java.lang.Thread.run(Thread.java:748)
> 2699Exception in thread "main" org.apache.hudi.exception.HoodieException: 
> Failed to run Test Suite 
> 2700at 
> org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.runTestSuite(HoodieTestSuiteJob.java:208)
> 2701at 
> org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.main(HoodieTestSuiteJob.java:170)
> 2702at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2703at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2704at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2705at java.lang.reflect.Method.invoke(Method.java:498)
> 2706at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> 2707at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
> 2708at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> 2709at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> 2710at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> 2711at 
>

[jira] [Updated] (HUDI-3609) Create scala version specific artifacts for hudi-spark-client

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3609:
-
Epic Link: HUDI-3679

> Create scala version specific artifacts for hudi-spark-client
> -
>
> Key: HUDI-3609
> URL: https://issues.apache.org/jira/browse/HUDI-3609
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dependencies
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.11.0
>
>
> Create scala version specific artifacts for hudi-spark-client.
>  
> As of now, we just generate one artifacts irrespective of scala or spark 
> version.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3609) Create scala version specific artifacts for hudi-spark-client

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3609:
-
Priority: Blocker  (was: Critical)

> Create scala version specific artifacts for hudi-spark-client
> -
>
> Key: HUDI-3609
> URL: https://issues.apache.org/jira/browse/HUDI-3609
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dependencies
>Reporter: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Create scala version specific artifacts for hudi-spark-client.
>  
> As of now, we just generate one artifacts irrespective of scala or spark 
> version.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3571) Add failure injection tests for spark datasource

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3571:
-
Priority: Major  (was: Blocker)

> Add failure injection tests for spark datasource
> 
>
> Key: HUDI-3571
> URL: https://issues.apache.org/jira/browse/HUDI-3571
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3571) Add failure injection tests for spark datasource

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3571:
-
Issue Type: Improvement  (was: Task)

> Add failure injection tests for spark datasource
> 
>
> Key: HUDI-3571
> URL: https://issues.apache.org/jira/browse/HUDI-3571
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3560) Add docker image for spark3 hadoop3 and hive3

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3560:
-
Reviewers: Alexey Kudinkin

> Add docker image for spark3 hadoop3 and hive3
> -
>
> Key: HUDI-3560
> URL: https://issues.apache.org/jira/browse/HUDI-3560
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: Rahil Chertara
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (HUDI-3524) Decouple basic and advanced configs in website

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-3524:


Assignee: Kyle Weller  (was: sivabalan narayanan)

> Decouple basic and advanced configs in website
> --
>
> Key: HUDI-3524
> URL: https://issues.apache.org/jira/browse/HUDI-3524
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs
>Reporter: sivabalan narayanan
>Assignee: Kyle Weller
>Priority: Critical
>  Labels: hudi-on-call, pull-request-available
> Fix For: 0.11.0
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Spark Datasource Configs
> Read configs:
> hoodie.datasource.query.type
> Write Options
> hoodie.datasource.write.operation
> hoodie.datasource.write.recordkey.field
> hoodie.datasource.write.precombine.field
> hoodie.datasource.write.partitionpath.field
> hoodie.datasource.write.keygenerator.class
> hoodie.datasource.write.payload.class
> hoodie.datasource.write.table.type
> hoodie.datasource.write.table.name
> hoodie.datasource.hive_sync.enable
> hoodie.datasource.write.hive_style_partitioning
> hoodie.datasource.hive_sync.mode
> hoodie.datasource.write.partitionpath.urlencode
> hoodie.datasource.hive_sync.partition_extractor_class
> hoodie.datasource.hive_sync.partition_fields
> Storage Config:
> hoodie.parquet.max.file.size
> hoodie.parquet.block.size
> HoodieMetadataConfig:
> hoodie.metadata.enable
> WriteConfigurations:
> hoodie.combine.before.upsert
> hoodie.write.markers.type
> hoodie.insert.shuffle.parallelism
> hoodie.rollback.parallelism
> hoodie.combine.before.delete
> hoodie.combine.before.insert
> hoodie.bulkinsert.shuffle.parallelism
> hoodie.delete.shuffle.parallelism
> hoodie.bulkinsert.sort.mode
> hoodie.embed.timeline.server
> hoodie.upsert.shuffle.parallelism
> hoodie.rollback.using.markers
> hoodie.finalize.write.parallelism
> Compaction config:
> hoodie.copyonwrite.record.size.estimate
> hoodie.compact.inline.max.delta.seconds
> hoodie.cleaner.policy
> hoodie.cleaner.commits.retained
> hoodie.clean.async
> hoodie.clean.automatic
> hoodie.commits.archival.batch
> hoodie.compact.inline
> hoodie.parquet.small.file.limit
> hoodie.compaction.strategy
> hoodie.archive.automatic
> hoodie.copyonwrite.insert.auto.split
> hoodie.compact.inline.max.delta.commits
> hoodie.keep.min.commits
> hoodie.cleaner.parallelism
> hoodie.record.size.estimation.threshold
> hoodie.compact.inline.trigger.strategy
> hoodie.keep.max.commits
> hoodie.copyonwrite.insert.split.size
> File System View Storage Configurations
> hoodie.filesystem.view.type
> hoodie.filesystem.view.secondary.type
> Index Configs:
> hoodie.index.type
> hoodie.index.bloom.fpp
> hoodie.index.bloom.num_entries
> hoodie.bloom.index.update.partition.path
> hoodie.bloom.index.use.caching
> hoodie.bloom.index.parallelism
> hoodie.bloom.index.prune.by.ranges
> hoodie.bloom.index.filter.type
> hoodie.simple.index.parallelism
> hoodie.simple.index.use.caching
> hoodie.global.simple.index.parallelism
> hoodie.simple.index.update.partition.path
> Common Configrations:
> hoodie.common.spillable.diskmap.type
> Metrics Configs:
> hoodie.metrics.reporter.type
> hoodie.metrics.on
> hoodie.metrics.reporter.class
> Record payload configs:
> hoodie.payload.ordering.field
> hoodie.payload.event.time.field



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3517:
-
Component/s: spark-sql

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: hudi-on-call
> Fix For: 0.12.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
>

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3517:
-
Issue Type: Improvement  (was: Bug)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: hudi-on-call
> Fix For: 0.12.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
>

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3517:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: hudi-on-call
> Fix For: 0.12.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
>

[jira] [Assigned] (HUDI-3485) Add support for scheduler configs for async clustering w/ deltastreamer and spark streamign

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-3485:


Assignee: Sagar Sumit  (was: sivabalan narayanan)

> Add support for scheduler configs for async clustering w/ deltastreamer and 
> spark streamign
> ---
>
> Key: HUDI-3485
> URL: https://issues.apache.org/jira/browse/HUDI-3485
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: Sagar Sumit
>Priority: Critical
>  Labels: hudi-on-call, pull-request-available
> Fix For: 0.11.0
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> We have support to set scheduler configs for compaction, but don't have one 
> for clustering. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3462) List of fixes to Metadata table after 0.10.1

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3462:
-
Fix Version/s: (was: 0.11.0)

> List of fixes to Metadata table after 0.10.1
> 
>
> Key: HUDI-3462
> URL: https://issues.apache.org/jira/browse/HUDI-3462
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3462) List of fixes to Metadata table after 0.10.1

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3462:
-
Priority: Blocker  (was: Critical)

> List of fixes to Metadata table after 0.10.1
> 
>
> Key: HUDI-3462
> URL: https://issues.apache.org/jira/browse/HUDI-3462
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Closed] (HUDI-3435) Do not throw exception when instant to rollback does not exist in metadata table active timeline

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-3435.

Resolution: Fixed

> Do not throw exception when instant to rollback does not exist in metadata 
> table active timeline
> 
>
> Key: HUDI-3435
> URL: https://issues.apache.org/jira/browse/HUDI-3435
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> See the stacktrace:
> {code:xml}
> Caused by: org.apache.hudi.exception.HoodieMetadataException: The instant 
> [20220214211929120__deltacommit__COMPLETED] required to sync rollback of 
> 20220214211929120 has been archived
>   at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.lambda$processRollbackMetadata$10(HoodieTableMetadataUtil.java:224)
>   at java.util.HashMap$Values.forEach(HashMap.java:982)
>   at 
> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1082)
>   at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.processRollbackMetadata(HoodieTableMetadataUtil.java:201)
>   at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.convertMetadataToRecords(HoodieTableMetadataUtil.java:178)
>   at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:653)
>   at 
> org.apache.hudi.table.action.BaseActionExecutor.lambda$writeTableMetadata$2(BaseActionExecutor.java:77)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
>   at 
> org.apache.hudi.table.action.BaseActionExecutor.writeTableMetadata(BaseActionExecutor.java:77)
>   at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.finishRollback(BaseRollbackActionExecutor.java:244)
>   at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.runRollback(BaseRollbackActionExecutor.java:122)
>   at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.execute(BaseRollbackActionExecutor.java:144)
>   at 
> org.apache.hudi.table.HoodieFlinkMergeOnReadTable.rollback(HoodieFlinkMergeOnReadTable.java:132)
>   at 
> org.apache.hudi.table.HoodieTable.rollbackInflightCompaction(HoodieTable.java:499)
>   at 
> org.apache.hudi.util.CompactionUtil.lambda$rollbackCompaction$1(CompactionUtil.java:163)
>   at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
>   at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
>   at 
> org.apache.hudi.util.CompactionUtil.rollbackCompaction(CompactionUtil.java:161)
>   at 
> org.apache.hudi.sink.compact.CompactionPlanOperator.open(CompactionPlanOperator.java:73)
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:442)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:582)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.executeRestore(StreamTask.java:562)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:647)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:537)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:759)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3462) List of fixes to Metadata table after 0.10.1

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3462:
-
Priority: Major  (was: Blocker)

> List of fixes to Metadata table after 0.10.1
> 
>
> Key: HUDI-3462
> URL: https://issues.apache.org/jira/browse/HUDI-3462
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Closed] (HUDI-3436) 0.11.0/0.10.2 release notes prep ticket

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-3436.

Resolution: Abandoned

> 0.11.0/0.10.2 release notes prep ticket
> ---
>
> Key: HUDI-3436
> URL: https://issues.apache.org/jira/browse/HUDI-3436
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>
> https://issues.apache.org/jira/browse/HUDI-3402
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3425) Clean up spill path created by Hudi during uneventful shutdown

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3425:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)

> Clean up spill path created by Hudi during uneventful shutdown
> --
>
> Key: HUDI-3425
> URL: https://issues.apache.org/jira/browse/HUDI-3425
> Project: Apache Hudi
>  Issue Type: Task
>  Components: compaction
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.12.0
>
>
> h1. Hudi spill path not getting cleared when containers getting killed 
> abruptly. 
>  
> When yarn kills the containers abruptly for any reason while hudi stage is in 
> progress then the spill path created by hudi on the disk is not cleaned and 
> as a result of which the nodes on the cluster start running out of space. We 
> need to clear the spill path manually to free out disk space.
>  
> Ref issue: https://github.com/apache/hudi/issues/4771



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3425) Clean up spill path created by Hudi during uneventful shutdown

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3425:
-
Issue Type: Improvement  (was: Task)

> Clean up spill path created by Hudi during uneventful shutdown
> --
>
> Key: HUDI-3425
> URL: https://issues.apache.org/jira/browse/HUDI-3425
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.0
>
>
> h1. Hudi spill path not getting cleared when containers getting killed 
> abruptly. 
>  
> When yarn kills the containers abruptly for any reason while hudi stage is in 
> progress then the spill path created by hudi on the disk is not cleaned and 
> as a result of which the nodes on the cluster start running out of space. We 
> need to clear the spill path manually to free out disk space.
>  
> Ref issue: https://github.com/apache/hudi/issues/4771



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3436) 0.11.0/0.10.2 release notes prep ticket

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3436:
-
Fix Version/s: (was: 0.11.0)

> 0.11.0/0.10.2 release notes prep ticket
> ---
>
> Key: HUDI-3436
> URL: https://issues.apache.org/jira/browse/HUDI-3436
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>
> https://issues.apache.org/jira/browse/HUDI-3402
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3425) Clean up spill path created by Hudi during uneventful shutdown

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3425:
-
Priority: Major  (was: Critical)

> Clean up spill path created by Hudi during uneventful shutdown
> --
>
> Key: HUDI-3425
> URL: https://issues.apache.org/jira/browse/HUDI-3425
> Project: Apache Hudi
>  Issue Type: Task
>  Components: compaction
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.0
>
>
> h1. Hudi spill path not getting cleared when containers getting killed 
> abruptly. 
>  
> When yarn kills the containers abruptly for any reason while hudi stage is in 
> progress then the spill path created by hudi on the disk is not cleaned and 
> as a result of which the nodes on the cluster start running out of space. We 
> need to clear the spill path manually to free out disk space.
>  
> Ref issue: https://github.com/apache/hudi/issues/4771



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Closed] (HUDI-3387) Enable async timeline server by default

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-3387.

Fix Version/s: (was: 0.11.0)
   Resolution: Duplicate

> Enable async timeline server by default
> ---
>
> Key: HUDI-3387
> URL: https://issues.apache.org/jira/browse/HUDI-3387
> Project: Apache Hudi
>  Issue Type: Task
>  Components: timeline-server
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> enable async timeline server by default 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3340) Fix deploy_staging_jars for diff spark versions

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3340:
-
Priority: Minor  (was: Critical)

> Fix deploy_staging_jars for diff spark versions
> ---
>
> Key: HUDI-3340
> URL: https://issues.apache.org/jira/browse/HUDI-3340
> Project: Apache Hudi
>  Issue Type: Task
>  Components: dev-experience
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Minor
>  Labels: hudi-on-call
> Fix For: 0.11.0
>
>
> with 0.10.1 onwards, we are using explicit spark versions while compiling. 
> This script (script/release/deploy_staging_jars.sh) needs some fixes around 
> setting the right spark version. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (HUDI-3340) Fix deploy_staging_jars for diff spark versions

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-3340:


Assignee: Raymond Xu  (was: sivabalan narayanan)

> Fix deploy_staging_jars for diff spark versions
> ---
>
> Key: HUDI-3340
> URL: https://issues.apache.org/jira/browse/HUDI-3340
> Project: Apache Hudi
>  Issue Type: Task
>  Components: dev-experience
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Minor
>  Labels: hudi-on-call
> Fix For: 0.11.0
>
>
> with 0.10.1 onwards, we are using explicit spark versions while compiling. 
> This script (script/release/deploy_staging_jars.sh) needs some fixes around 
> setting the right spark version. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3291) Flip Default record paylod to DefaultHoodieRecordPayload

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3291:
-
Epic Link: HUDI-3217

> Flip Default record paylod to DefaultHoodieRecordPayload
> 
>
> Key: HUDI-3291
> URL: https://issues.apache.org/jira/browse/HUDI-3291
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3291) Flip Default record paylod to DefaultHoodieRecordPayload

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3291:
-
Issue Type: Improvement  (was: Task)

> Flip Default record paylod to DefaultHoodieRecordPayload
> 
>
> Key: HUDI-3291
> URL: https://issues.apache.org/jira/browse/HUDI-3291
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3291) Flip Default record paylod to DefaultHoodieRecordPayload

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3291:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)

> Flip Default record paylod to DefaultHoodieRecordPayload
> 
>
> Key: HUDI-3291
> URL: https://issues.apache.org/jira/browse/HUDI-3291
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Closed] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-3242.

Resolution: Information Provided

Resolved for user

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Minor
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
>  \
> --hoodie-conf

[jira] [Updated] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3242:
-
Fix Version/s: (was: 0.12.0)

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Minor
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
>  \
> --hoodie-conf

[jira] [Updated] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3242:
-
Fix Version/s: 0.12.0

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Minor
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Fix For: 0.12.0
>
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
>  \
> --hoodie-conf

[jira] [Updated] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3242:
-
Fix Version/s: (was: 0.11.0)

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Minor
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
>  \
> --hoodie-conf

[jira] [Updated] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3242:
-
Priority: Minor  (was: Major)

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Minor
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
>  \
> --hoodie-conf

[GitHub] [hudi] codope commented on a change in pull request #5043: [HUDI-3485] Adding scheduler pool configs for async clustering

2022-03-28 Thread GitBox



codope commented on a change in pull request #5043:
URL: https://github.com/apache/hudi/pull/5043#discussion_r837068928



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
##
@@ -388,6 +388,14 @@ private boolean onDeltaSyncShutdown(boolean error) {
 @Parameter(names = {"--retry-last-pending-inline-clustering", "-rc"}, 
description = "Retry last pending inline clustering plan before writing to 
sink.")
 public Boolean retryLastPendingInlineClusteringJob = false;
 
+@Parameter(names = {"--cluster-scheduling-weight"}, description = 
"Scheduling weight for clustering as defined in "
++ "https://spark.apache.org/docs/latest/job-scheduling.html;)
+public Integer clusterSchedulingWeight = 1;

Review comment:
   Just a thought, does it make sense to determine the clustering and 
compaction weight depending on their commit frequency ratio?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3216) Support timestamp with microseconds precision

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3216:
-
Priority: Critical  (was: Major)

> Support timestamp with microseconds precision
> -
>
> Key: HUDI-3216
> URL: https://issues.apache.org/jira/browse/HUDI-3216
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, writer-core
>Reporter: sivabalan narayanan
>Priority: Critical
>  Labels: user-support-issues
> Fix For: 0.12.0
>
>
> As of now, if a field with timestamp datatype w/ microsec precision is 
> ingested to hudi, resultant dataset will only have until millisec 
> granularity. 
> Ref issue: [https://github.com/apache/hudi/issues/3429]
>  
> We might need to support microsecs granularity. 
> References issue has some pointers on how to go about it. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3216) Support timestamp with microseconds precision

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3216:
-
Issue Type: Improvement  (was: Task)

> Support timestamp with microseconds precision
> -
>
> Key: HUDI-3216
> URL: https://issues.apache.org/jira/browse/HUDI-3216
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, writer-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: user-support-issues
> Fix For: 0.11.0
>
>
> As of now, if a field with timestamp datatype w/ microsec precision is 
> ingested to hudi, resultant dataset will only have until millisec 
> granularity. 
> Ref issue: [https://github.com/apache/hudi/issues/3429]
>  
> We might need to support microsecs granularity. 
> References issue has some pointers on how to go about it. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3216) Support timestamp with microseconds precision

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3216:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)

> Support timestamp with microseconds precision
> -
>
> Key: HUDI-3216
> URL: https://issues.apache.org/jira/browse/HUDI-3216
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, writer-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: user-support-issues
> Fix For: 0.12.0
>
>
> As of now, if a field with timestamp datatype w/ microsec precision is 
> ingested to hudi, resultant dataset will only have until millisec 
> granularity. 
> Ref issue: [https://github.com/apache/hudi/issues/3429]
>  
> We might need to support microsecs granularity. 
> References issue has some pointers on how to go about it. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3068) Add support to sync all partitions in hive sync tool

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3068:
-
Priority: Blocker  (was: Major)

> Add support to sync all partitions in hive sync tool
> 
>
> Key: HUDI-3068
> URL: https://issues.apache.org/jira/browse/HUDI-3068
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: meta-sync
>Reporter: sivabalan narayanan
>Assignee: Harshal Patil
>Priority: Blocker
>  Labels: pull-request-available, sev:critical
> Fix For: 0.12.0
>
>
> If a user runs hive sync occationally and if archival kicked in and trimmed 
> some commits and if there were partitions added during those commits which 
> was never updated later, hive sync will miss out those partitions. 
> {code:java}
>   LOG.info("Last commit time synced is " + lastCommitTimeSynced.get() + ", 
> Getting commits since then");
>   return 
> TimelineUtils.getPartitionsWritten(metaClient.getActiveTimeline().getCommitsTimeline()
>   .findInstantsAfter(lastCommitTimeSynced.get(), Integer.MAX_VALUE));
> } {code}
> bcoz, we for recurrent syncs, we always fetch new commits from timeline after 
> the last synced instant and fetch commit metadata and go on to fetch the 
> partitions added as part of it. 
>  
> We can add a new config to hive sync tool to override this behavior. 
> --sync-all-partitions 
> when this config is set to true, we should ignore last synced instant and 
> should go the below route which is done when syncing for the first time. 
>  
> {code:java}
> if (!lastCommitTimeSynced.isPresent()) {
>   LOG.info("Last commit time synced is not known, listing all partitions in " 
> + basePath + ",FS :" + fs);
>   HoodieLocalEngineContext engineContext = new 
> HoodieLocalEngineContext(metaClient.getHadoopConf());
>   return FSUtils.getAllPartitionPaths(engineContext, basePath, 
> useFileListingFromMetadata, assumeDatePartitioning);
> } {code}
>  
>  
> Ref issue: 
> https://github.com/apache/hudi/issues/3890



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3068) Add support to sync all partitions in hive sync tool

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3068:
-
Component/s: meta-sync
 (was: hive)

> Add support to sync all partitions in hive sync tool
> 
>
> Key: HUDI-3068
> URL: https://issues.apache.org/jira/browse/HUDI-3068
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: meta-sync
>Reporter: sivabalan narayanan
>Assignee: Harshal Patil
>Priority: Major
>  Labels: pull-request-available, sev:critical
> Fix For: 0.12.0
>
>
> If a user runs hive sync occationally and if archival kicked in and trimmed 
> some commits and if there were partitions added during those commits which 
> was never updated later, hive sync will miss out those partitions. 
> {code:java}
>   LOG.info("Last commit time synced is " + lastCommitTimeSynced.get() + ", 
> Getting commits since then");
>   return 
> TimelineUtils.getPartitionsWritten(metaClient.getActiveTimeline().getCommitsTimeline()
>   .findInstantsAfter(lastCommitTimeSynced.get(), Integer.MAX_VALUE));
> } {code}
> bcoz, we for recurrent syncs, we always fetch new commits from timeline after 
> the last synced instant and fetch commit metadata and go on to fetch the 
> partitions added as part of it. 
>  
> We can add a new config to hive sync tool to override this behavior. 
> --sync-all-partitions 
> when this config is set to true, we should ignore last synced instant and 
> should go the below route which is done when syncing for the first time. 
>  
> {code:java}
> if (!lastCommitTimeSynced.isPresent()) {
>   LOG.info("Last commit time synced is not known, listing all partitions in " 
> + basePath + ",FS :" + fs);
>   HoodieLocalEngineContext engineContext = new 
> HoodieLocalEngineContext(metaClient.getHadoopConf());
>   return FSUtils.getAllPartitionPaths(engineContext, basePath, 
> useFileListingFromMetadata, assumeDatePartitioning);
> } {code}
>  
>  
> Ref issue: 
> https://github.com/apache/hudi/issues/3890



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3068) Add support to sync all partitions in hive sync tool

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3068:
-
Issue Type: New Feature  (was: Improvement)

> Add support to sync all partitions in hive sync tool
> 
>
> Key: HUDI-3068
> URL: https://issues.apache.org/jira/browse/HUDI-3068
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: hive
>Reporter: sivabalan narayanan
>Assignee: Harshal Patil
>Priority: Major
>  Labels: pull-request-available, sev:critical
> Fix For: 0.12.0
>
>
> If a user runs hive sync occationally and if archival kicked in and trimmed 
> some commits and if there were partitions added during those commits which 
> was never updated later, hive sync will miss out those partitions. 
> {code:java}
>   LOG.info("Last commit time synced is " + lastCommitTimeSynced.get() + ", 
> Getting commits since then");
>   return 
> TimelineUtils.getPartitionsWritten(metaClient.getActiveTimeline().getCommitsTimeline()
>   .findInstantsAfter(lastCommitTimeSynced.get(), Integer.MAX_VALUE));
> } {code}
> bcoz, we for recurrent syncs, we always fetch new commits from timeline after 
> the last synced instant and fetch commit metadata and go on to fetch the 
> partitions added as part of it. 
>  
> We can add a new config to hive sync tool to override this behavior. 
> --sync-all-partitions 
> when this config is set to true, we should ignore last synced instant and 
> should go the below route which is done when syncing for the first time. 
>  
> {code:java}
> if (!lastCommitTimeSynced.isPresent()) {
>   LOG.info("Last commit time synced is not known, listing all partitions in " 
> + basePath + ",FS :" + fs);
>   HoodieLocalEngineContext engineContext = new 
> HoodieLocalEngineContext(metaClient.getHadoopConf());
>   return FSUtils.getAllPartitionPaths(engineContext, basePath, 
> useFileListingFromMetadata, assumeDatePartitioning);
> } {code}
>  
>  
> Ref issue: 
> https://github.com/apache/hudi/issues/3890



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3068) Add support to sync all partitions in hive sync tool

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3068:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)

> Add support to sync all partitions in hive sync tool
> 
>
> Key: HUDI-3068
> URL: https://issues.apache.org/jira/browse/HUDI-3068
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: hive
>Reporter: sivabalan narayanan
>Assignee: Harshal Patil
>Priority: Major
>  Labels: pull-request-available, sev:critical
> Fix For: 0.12.0
>
>
> If a user runs hive sync occationally and if archival kicked in and trimmed 
> some commits and if there were partitions added during those commits which 
> was never updated later, hive sync will miss out those partitions. 
> {code:java}
>   LOG.info("Last commit time synced is " + lastCommitTimeSynced.get() + ", 
> Getting commits since then");
>   return 
> TimelineUtils.getPartitionsWritten(metaClient.getActiveTimeline().getCommitsTimeline()
>   .findInstantsAfter(lastCommitTimeSynced.get(), Integer.MAX_VALUE));
> } {code}
> bcoz, we for recurrent syncs, we always fetch new commits from timeline after 
> the last synced instant and fetch commit metadata and go on to fetch the 
> partitions added as part of it. 
>  
> We can add a new config to hive sync tool to override this behavior. 
> --sync-all-partitions 
> when this config is set to true, we should ignore last synced instant and 
> should go the below route which is done when syncing for the first time. 
>  
> {code:java}
> if (!lastCommitTimeSynced.isPresent()) {
>   LOG.info("Last commit time synced is not known, listing all partitions in " 
> + basePath + ",FS :" + fs);
>   HoodieLocalEngineContext engineContext = new 
> HoodieLocalEngineContext(metaClient.getHadoopConf());
>   return FSUtils.getAllPartitionPaths(engineContext, basePath, 
> useFileListingFromMetadata, assumeDatePartitioning);
> } {code}
>  
>  
> Ref issue: 
> https://github.com/apache/hudi/issues/3890



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3062) savepoint rollback of last but one savepoint fails

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3062:
-
Priority: Blocker  (was: Critical)

> savepoint rollback of last but one savepoint fails
> --
>
> Key: HUDI-3062
> URL: https://issues.apache.org/jira/browse/HUDI-3062
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: sivabalan narayanan
>Priority: Blocker
>  Labels: sev:critical
> Fix For: 0.11.0
>
>
> so, I created 2 savepoints as below. 
> c1, c2, c3, sp1, c4, sp2, c5.
> tried savepoint rollback for sp2 and it worked. but left trailing rollback 
> meta files. 
> again tried to savepoint roll back with sp1 and it failed. stacktrace does 
> not have sufficient info. 
> {code:java}
> 21/12/18 06:20:00 INFO HoodieActiveTimeline: Loaded instants upto : 
> Option{val=[==>20211218061954430__rollback__REQUESTED]}
> 21/12/18 06:20:00 INFO BaseRollbackPlanActionExecutor: Requesting Rollback 
> with instant time [==>20211218061954430__rollback__REQUESTED]
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 66
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 26
> 21/12/18 06:20:00 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 
> 192.168.1.4:54359 in memory (size: 25.5 KB, free: 366.2 MB)
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 110
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 99
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 47
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 21
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 43
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 55
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 104
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 124
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 29
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 91
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 123
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 120
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 25
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 32
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 92
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 76
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 89
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 102
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 50
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 49
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 116
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 96
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 118
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 44
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 60
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 87
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 77
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 75
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 9
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 72
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 2
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 37
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 113
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 67
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 28
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 95
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 59
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 68
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 45
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 39
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 74
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 20
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 90
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 56
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 58
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 61
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 13
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 46
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 101
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 105
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 81
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 63
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 78
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 4
> 21/12/18 06:20:00 INFO ContextCleaner: Cleaned accumulator 31
>

[jira] [Updated] (HUDI-3054) Fix flaky TestHoodieClientMultiWriter. testHoodieClientBasicMultiWriter

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3054:
-
Issue Type: Test  (was: Task)

> Fix flaky TestHoodieClientMultiWriter. testHoodieClientBasicMultiWriter
> ---
>
> Key: HUDI-3054
> URL: https://issues.apache.org/jira/browse/HUDI-3054
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Testing, tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> Ref: 
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/4428/logs/21]
>  
> {code:java}
> 2021-12-17T11:39:57.1645757Z [INFO] Running 
> org.apache.hudi.client.TestHoodieClientMultiWriter
> 2021-12-17T11:39:57.3453991Z 339506 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit8865530218583640556/dataset/.hoodie/metadata
> 2021-12-17T11:39:57.3984328Z 339559 [dispatcher-event-loop-5] WARN  
> org.apache.spark.scheduler.TaskSetManager  - Stage 0 contains a task of very 
> large size (101 KB). The maximum recommended task size is 100 KB.
> 2021-12-17T11:39:57.5278608Z 339689 [dispatcher-event-loop-2] WARN  
> org.apache.spark.scheduler.TaskSetManager  - Stage 1 contains a task of very 
> large size (101 KB). The maximum recommended task size is 100 KB.
> 2021-12-17T11:39:57.9783107Z 340139 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit8865530218583640556/dataset/.hoodie/metadata
> 2021-12-17T11:39:57.9927490Z 340154 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit8865530218583640556/dataset/.hoodie/metadata
> 2021-12-17T11:40:10.1428665Z 352304 [main] WARN  
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> 2021-12-17T11:40:10.9930023Z 353149 [main] WARN  
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> 2021-12-17T11:40:11.4294603Z 353590 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit3262960667280061850/dataset/.hoodie/metadata
> 2021-12-17T11:40:11.4763085Z 353637 [dispatcher-event-loop-5] WARN  
> org.apache.spark.scheduler.TaskSetManager  - Stage 0 contains a task of very 
> large size (101 KB). The maximum recommended task size is 100 KB.
> 2021-12-17T11:40:11.6014876Z 353762 [dispatcher-event-loop-2] WARN  
> org.apache.spark.scheduler.TaskSetManager  - Stage 1 contains a task of very 
> large size (101 KB). The maximum recommended task size is 100 KB.
> 2021-12-17T11:40:12.0892513Z 354250 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit3262960667280061850/dataset/.hoodie/metadata
> 2021-12-17T11:40:12.1061317Z 354267 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit3262960667280061850/dataset/.hoodie/metadata
> 2021-12-17T11:40:23.1499732Z 365311 [main] WARN  
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> 2021-12-17T11:40:24.1626167Z 366323 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit294667857867877904/dataset/.hoodie/metadata
> 2021-12-17T11:40:24.1945944Z 366355 [dispatcher-event-loop-5] WARN  
> org.apache.spark.scheduler.TaskSetManager  - Stage 0 contains a task of very 
> large size (101 KB). The maximum recommended task size is 100 KB.
> 2021-12-17T11:40:24.3084730Z 366469 [dispatcher-event-loop-2] WARN  
> org.apache.spark.scheduler.TaskSetManager  - Stage 1 contains a task of very 
> large size (101 KB). The maximum recommended task size is 100 KB.
> 2021-12-17T11:40:24.7350862Z 366896 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit294667857867877904/dataset/.hoodie/metadata
> 2021-12-17T11:40:24.7482727Z 366909 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit294667857867877904/dataset/.hoodie/metadata
> 2021-12-17T11:40:43.1530857Z 385314 [main] WARN  
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> 2021-12-17T11:40:44.0641298Z 386225 [main] WARN  
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
>

[jira] [Updated] (HUDI-3054) Fix flaky TestHoodieClientMultiWriter. testHoodieClientBasicMultiWriter

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3054:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)

> Fix flaky TestHoodieClientMultiWriter. testHoodieClientBasicMultiWriter
> ---
>
> Key: HUDI-3054
> URL: https://issues.apache.org/jira/browse/HUDI-3054
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Testing, tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> Ref: 
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/4428/logs/21]
>  
> {code:java}
> 2021-12-17T11:39:57.1645757Z [INFO] Running 
> org.apache.hudi.client.TestHoodieClientMultiWriter
> 2021-12-17T11:39:57.3453991Z 339506 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit8865530218583640556/dataset/.hoodie/metadata
> 2021-12-17T11:39:57.3984328Z 339559 [dispatcher-event-loop-5] WARN  
> org.apache.spark.scheduler.TaskSetManager  - Stage 0 contains a task of very 
> large size (101 KB). The maximum recommended task size is 100 KB.
> 2021-12-17T11:39:57.5278608Z 339689 [dispatcher-event-loop-2] WARN  
> org.apache.spark.scheduler.TaskSetManager  - Stage 1 contains a task of very 
> large size (101 KB). The maximum recommended task size is 100 KB.
> 2021-12-17T11:39:57.9783107Z 340139 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit8865530218583640556/dataset/.hoodie/metadata
> 2021-12-17T11:39:57.9927490Z 340154 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit8865530218583640556/dataset/.hoodie/metadata
> 2021-12-17T11:40:10.1428665Z 352304 [main] WARN  
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> 2021-12-17T11:40:10.9930023Z 353149 [main] WARN  
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> 2021-12-17T11:40:11.4294603Z 353590 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit3262960667280061850/dataset/.hoodie/metadata
> 2021-12-17T11:40:11.4763085Z 353637 [dispatcher-event-loop-5] WARN  
> org.apache.spark.scheduler.TaskSetManager  - Stage 0 contains a task of very 
> large size (101 KB). The maximum recommended task size is 100 KB.
> 2021-12-17T11:40:11.6014876Z 353762 [dispatcher-event-loop-2] WARN  
> org.apache.spark.scheduler.TaskSetManager  - Stage 1 contains a task of very 
> large size (101 KB). The maximum recommended task size is 100 KB.
> 2021-12-17T11:40:12.0892513Z 354250 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit3262960667280061850/dataset/.hoodie/metadata
> 2021-12-17T11:40:12.1061317Z 354267 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit3262960667280061850/dataset/.hoodie/metadata
> 2021-12-17T11:40:23.1499732Z 365311 [main] WARN  
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> 2021-12-17T11:40:24.1626167Z 366323 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit294667857867877904/dataset/.hoodie/metadata
> 2021-12-17T11:40:24.1945944Z 366355 [dispatcher-event-loop-5] WARN  
> org.apache.spark.scheduler.TaskSetManager  - Stage 0 contains a task of very 
> large size (101 KB). The maximum recommended task size is 100 KB.
> 2021-12-17T11:40:24.3084730Z 366469 [dispatcher-event-loop-2] WARN  
> org.apache.spark.scheduler.TaskSetManager  - Stage 1 contains a task of very 
> large size (101 KB). The maximum recommended task size is 100 KB.
> 2021-12-17T11:40:24.7350862Z 366896 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit294667857867877904/dataset/.hoodie/metadata
> 2021-12-17T11:40:24.7482727Z 366909 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit294667857867877904/dataset/.hoodie/metadata
> 2021-12-17T11:40:43.1530857Z 385314 [main] WARN  
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> 2021-12-17T11:40:44.0641298Z 386225 [main] WARN  
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in

[jira] [Updated] (HUDI-2866) Get Metadata table bootstrapping in Flink in parity with spark

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2866:
-
Component/s: flink
 metadata

> Get Metadata table bootstrapping in Flink in parity with spark
> --
>
> Key: HUDI-2866
> URL: https://issues.apache.org/jira/browse/HUDI-2866
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink, metadata
>Reporter: sivabalan narayanan
>Assignee: Danny Chen
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Please revisit flink code and ensure we are having parity with spark. 
> Especially around metadata table bootstrap. I don't see concurrency support 
> in flink and so, these changes are not as mandatory as we need in spark, but 
> nevertheless, good to be in sync. 
>  
> Related tickets:
> [https://github.com/apache/hudi/pull/4114]
> [https://github.com/apache/hudi/pull/4124]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2866) Get Metadata table bootstrapping in Flink in parity with spark

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2866:
-
Issue Type: Improvement  (was: New Feature)

> Get Metadata table bootstrapping in Flink in parity with spark
> --
>
> Key: HUDI-2866
> URL: https://issues.apache.org/jira/browse/HUDI-2866
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: sivabalan narayanan
>Assignee: Danny Chen
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Please revisit flink code and ensure we are having parity with spark. 
> Especially around metadata table bootstrap. I don't see concurrency support 
> in flink and so, these changes are not as mandatory as we need in spark, but 
> nevertheless, good to be in sync. 
>  
> Related tickets:
> [https://github.com/apache/hudi/pull/4114]
> [https://github.com/apache/hudi/pull/4124]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2866) Get Metadata table bootstrapping in Flink in parity with spark

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2866:
-
Issue Type: New Feature  (was: Task)

> Get Metadata table bootstrapping in Flink in parity with spark
> --
>
> Key: HUDI-2866
> URL: https://issues.apache.org/jira/browse/HUDI-2866
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: sivabalan narayanan
>Assignee: Danny Chen
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Please revisit flink code and ensure we are having parity with spark. 
> Especially around metadata table bootstrap. I don't see concurrency support 
> in flink and so, these changes are not as mandatory as we need in spark, but 
> nevertheless, good to be in sync. 
>  
> Related tickets:
> [https://github.com/apache/hudi/pull/4114]
> [https://github.com/apache/hudi/pull/4124]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2782) Fix marker based strategy for structured streaming

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2782:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)

> Fix marker based strategy for structured streaming
> --
>
> Key: HUDI-2782
> URL: https://issues.apache.org/jira/browse/HUDI-2782
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.0
>
>
> As part of [this|https://github.com/apache/hudi/pull/3967] patch, we are 
> making timeline server based as the default marker type. But we have an issue 
> w/ structured streaming. Looks like after 1st micro batch, the timeline 
> server gets shutdown and for subsequent micro batches, timeline server is not 
> available. So, in the patch we have made marker based overridden just for 
> structured streaming. 
>  
> We may want to revisit this and see how to go about it. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HUDI-2768) Enable async timeline server by default

2022-03-28 Thread Raymond Xu (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513805#comment-17513805
 ] 

Raymond Xu commented on HUDI-2768:
--

[https://github.com/apache/hudi/pull/4807]
WIP PR

 

> Enable async timeline server by default
> ---
>
> Key: HUDI-2768
> URL: https://issues.apache.org/jira/browse/HUDI-2768
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: timeline-server, writer-core
>Reporter: sivabalan narayanan
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: hudi-on-call
> Fix For: 0.12.0
>
>
> Enable async timeline server by default.
>  
> [https://github.com/apache/hudi/pull/3949]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2768) Enable async timeline server by default

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2768:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)

> Enable async timeline server by default
> ---
>
> Key: HUDI-2768
> URL: https://issues.apache.org/jira/browse/HUDI-2768
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: timeline-server, writer-core
>Reporter: sivabalan narayanan
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: hudi-on-call
> Fix For: 0.12.0
>
>
> Enable async timeline server by default.
>  
> [https://github.com/apache/hudi/pull/3949]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-1456) [UMBRELLA] Concurrency Control for Hudi writers and table services

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1456:
-
Priority: Blocker  (was: Major)

> [UMBRELLA] Concurrency Control for Hudi writers and table services
> --
>
> Key: HUDI-1456
> URL: https://issues.apache.org/jira/browse/HUDI-1456
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: writer-core
>Affects Versions: 0.9.0
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Blocker
>  Labels: hudi-umbrellas
> Fix For: 0.12.0
>
> Attachments: image-2020-12-14-09-48-46-946.png
>
>
> This ticket tracks all the changes needed to support concurrency control for 
> Hudi tables. This work will be done in multiple phases. 
>  # Parallel writing to Hudi tables support -> This feature will allow users 
> to have multiple writers mutate the tables without the ability to perform 
> concurrent update to the same file. 
>  # Concurrency control at file/record level -> This feature will allow users 
> to have multiple writers mutate the tables with the ability to ensure 
> serializability at record level.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-1456) [UMBRELLA] Concurrency Control for Hudi writers and table services

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1456:
-
Fix Version/s: 0.12.0

> [UMBRELLA] Concurrency Control for Hudi writers and table services
> --
>
> Key: HUDI-1456
> URL: https://issues.apache.org/jira/browse/HUDI-1456
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: writer-core
>Affects Versions: 0.9.0
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 0.12.0
>
> Attachments: image-2020-12-14-09-48-46-946.png
>
>
> This ticket tracks all the changes needed to support concurrency control for 
> Hudi tables. This work will be done in multiple phases. 
>  # Parallel writing to Hudi tables support -> This feature will allow users 
> to have multiple writers mutate the tables without the ability to perform 
> concurrent update to the same file. 
>  # Concurrency control at file/record level -> This feature will allow users 
> to have multiple writers mutate the tables with the ability to ensure 
> serializability at record level.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2635) Fix double locking issue with multi-writers with proper abstraction around trnx manager

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2635:
-
Issue Type: Improvement  (was: Task)

> Fix double locking issue with multi-writers with proper abstraction around 
> trnx manager
> ---
>
> Key: HUDI-2635
> URL: https://issues.apache.org/jira/browse/HUDI-2635
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.11.0
>
>
> Fix double locking issue with multi-writers with proper abstraction around 
> trnx manager



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2613) Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2613:
-
Issue Type: Improvement  (was: Task)

> Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus
> 
>
> Key: HUDI-2613
> URL: https://issues.apache.org/jira/browse/HUDI-2613
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.0
>
>
> Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus instead of 
> getDeltalogs()



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2635) Fix double locking issue with multi-writers with proper abstraction around trnx manager

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2635:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)

> Fix double locking issue with multi-writers with proper abstraction around 
> trnx manager
> ---
>
> Key: HUDI-2635
> URL: https://issues.apache.org/jira/browse/HUDI-2635
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.0
>
>
> Fix double locking issue with multi-writers with proper abstraction around 
> trnx manager



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2635) Fix double locking issue with multi-writers with proper abstraction around trnx manager

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2635:
-
Component/s: multi-writer

> Fix double locking issue with multi-writers with proper abstraction around 
> trnx manager
> ---
>
> Key: HUDI-2635
> URL: https://issues.apache.org/jira/browse/HUDI-2635
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: multi-writer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.0
>
>
> Fix double locking issue with multi-writers with proper abstraction around 
> trnx manager



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] hudi-bot removed a comment on pull request #5164: [HUDI-3741] Fix flink bucket index bulk insert generates too many sma…

2022-03-28 Thread GitBox



hudi-bot removed a comment on pull request #5164:
URL: https://github.com/apache/hudi/pull/5164#issuecomment-1081411848


   
   ## CI report:
   
   * d7552a06e27b4ecb13d6fd290a48bad1cfddb58f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-2613) Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2613:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)

> Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus
> 
>
> Key: HUDI-2613
> URL: https://issues.apache.org/jira/browse/HUDI-2613
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.0
>
>
> Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus instead of 
> getDeltalogs()



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HUDI-2559) Ensure unique timestamps are generated for commit times with concurrent writers

2022-03-28 Thread Raymond Xu (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513803#comment-17513803
 ] 

Raymond Xu commented on HUDI-2559:
--

We need to eliminate the issue with locks or giving identifier to writers for 
multi-writer scenario. the last patch was a mitigation. More work to continue 
here.

> Ensure unique timestamps are generated for commit times with concurrent 
> writers
> ---
>
> Key: HUDI-2559
> URL: https://issues.apache.org/jira/browse/HUDI-2559
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: multi-writer, writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available, release-blocker
> Fix For: 0.11.0
>
>
> Ensure unique timestamps are generated for commit times with concurrent 
> writers.
> this is the piece of code in HoodieActiveTimeline which creates a new commit 
> time.
> {code:java}
> public static String createNewInstantTime(long milliseconds) {
>   return lastInstantTime.updateAndGet((oldVal) -> {
> String newCommitTime;
> do {
>   newCommitTime = HoodieActiveTimeline.COMMIT_FORMATTER.format(new 
> Date(System.currentTimeMillis() + milliseconds));
> } while (HoodieTimeline.compareTimestamps(newCommitTime, 
> LESSER_THAN_OR_EQUALS, oldVal));
> return newCommitTime;
>   });
> }
> {code}
> There are chances that a deltastreamer and a concurrent spark ds writer gets 
> same timestamp and one of them fails. 
> Related issues and github jiras: 
> [https://github.com/apache/hudi/issues/3782]
> https://issues.apache.org/jira/browse/HUDI-2549
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] hudi-bot commented on pull request #5159: [HUDI-3731] Fixing Column Stats Index record Merging sequence missing `columnName`

2022-03-28 Thread GitBox



hudi-bot commented on pull request #5159:
URL: https://github.com/apache/hudi/pull/5159#issuecomment-1081413544


   
   ## CI report:
   
   * f9075077ff6d7b14bfaebe9d62b10b141ee9738c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7478)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7480)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7486)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-2559) Ensure unique timestamps are generated for commit times with concurrent writers

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2559:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)

> Ensure unique timestamps are generated for commit times with concurrent 
> writers
> ---
>
> Key: HUDI-2559
> URL: https://issues.apache.org/jira/browse/HUDI-2559
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: multi-writer, writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available, release-blocker
> Fix For: 0.12.0
>
>
> Ensure unique timestamps are generated for commit times with concurrent 
> writers.
> this is the piece of code in HoodieActiveTimeline which creates a new commit 
> time.
> {code:java}
> public static String createNewInstantTime(long milliseconds) {
>   return lastInstantTime.updateAndGet((oldVal) -> {
> String newCommitTime;
> do {
>   newCommitTime = HoodieActiveTimeline.COMMIT_FORMATTER.format(new 
> Date(System.currentTimeMillis() + milliseconds));
> } while (HoodieTimeline.compareTimestamps(newCommitTime, 
> LESSER_THAN_OR_EQUALS, oldVal));
> return newCommitTime;
>   });
> }
> {code}
> There are chances that a deltastreamer and a concurrent spark ds writer gets 
> same timestamp and one of them fails. 
> Related issues and github jiras: 
> [https://github.com/apache/hudi/issues/3782]
> https://issues.apache.org/jira/browse/HUDI-2549
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2613) Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2613:
-
Component/s: code-quality

> Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus
> 
>
> Key: HUDI-2613
> URL: https://issues.apache.org/jira/browse/HUDI-2613
> Project: Apache Hudi
>  Issue Type: Task
>  Components: code-quality
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.0
>
>
> Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus instead of 
> getDeltalogs()



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] hudi-bot commented on pull request #5164: [HUDI-3741] Fix flink bucket index bulk insert generates too many sma…

2022-03-28 Thread GitBox



hudi-bot commented on pull request #5164:
URL: https://github.com/apache/hudi/pull/5164#issuecomment-1081413569


   
   ## CI report:
   
   * d7552a06e27b4ecb13d6fd290a48bad1cfddb58f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7494)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #5159: [HUDI-3731] Fixing Column Stats Index record Merging sequence missing `columnName`

2022-03-28 Thread GitBox



hudi-bot removed a comment on pull request #5159:
URL: https://github.com/apache/hudi/pull/5159#issuecomment-1081318425


   
   ## CI report:
   
   * f9075077ff6d7b14bfaebe9d62b10b141ee9738c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7478)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7480)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7486)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-2559) Ensure unique timestamps are generated for commit times with concurrent writers

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2559:
-
Issue Type: Improvement  (was: Task)

> Ensure unique timestamps are generated for commit times with concurrent 
> writers
> ---
>
> Key: HUDI-2559
> URL: https://issues.apache.org/jira/browse/HUDI-2559
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: multi-writer, writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available, release-blocker
> Fix For: 0.11.0
>
>
> Ensure unique timestamps are generated for commit times with concurrent 
> writers.
> this is the piece of code in HoodieActiveTimeline which creates a new commit 
> time.
> {code:java}
> public static String createNewInstantTime(long milliseconds) {
>   return lastInstantTime.updateAndGet((oldVal) -> {
> String newCommitTime;
> do {
>   newCommitTime = HoodieActiveTimeline.COMMIT_FORMATTER.format(new 
> Date(System.currentTimeMillis() + milliseconds));
> } while (HoodieTimeline.compareTimestamps(newCommitTime, 
> LESSER_THAN_OR_EQUALS, oldVal));
> return newCommitTime;
>   });
> }
> {code}
> There are chances that a deltastreamer and a concurrent spark ds writer gets 
> same timestamp and one of them fails. 
> Related issues and github jiras: 
> [https://github.com/apache/hudi/issues/3782]
> https://issues.apache.org/jira/browse/HUDI-2549
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2473) Fix compaction action type in commit metadata

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2473:
-
Priority: Blocker  (was: Major)

> Fix compaction action type in commit metadata
> -
>
> Key: HUDI-2473
> URL: https://issues.apache.org/jira/browse/HUDI-2473
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: table-service
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Fix compaction action type in commit metadata.
> as of now, it is empty for compaction commit. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] hudi-bot commented on pull request #5164: [HUDI-3741] Fix flink bucket index bulk insert generates too many sma…

2022-03-28 Thread GitBox



hudi-bot commented on pull request #5164:
URL: https://github.com/apache/hudi/pull/5164#issuecomment-1081411848


   
   ## CI report:
   
   * d7552a06e27b4ecb13d6fd290a48bad1cfddb58f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-2466) Add and validate comprehensive yamls for spark dml

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2466:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)

> Add and validate comprehensive yamls for spark dml 
> ---
>
> Key: HUDI-2466
> URL: https://issues.apache.org/jira/browse/HUDI-2466
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.12.0
>
>
> Once merge and update is supported, we need to add and validate comprehensive 
> yamls for spark dml using test suite infra.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2466) Add and validate comprehensive yamls for spark dml

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2466:
-
Issue Type: Test  (was: Task)

> Add and validate comprehensive yamls for spark dml 
> ---
>
> Key: HUDI-2466
> URL: https://issues.apache.org/jira/browse/HUDI-2466
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.12.0
>
>
> Once merge and update is supported, we need to add and validate comprehensive 
> yamls for spark dml using test suite infra.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2464) Create comprehensive spark datasource yamls similar to deltastreamer

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2464:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)

> Create comprehensive spark datasource yamls similar to deltastreamer
> 
>
> Key: HUDI-2464
> URL: https://issues.apache.org/jira/browse/HUDI-2464
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.0
>
>
> We have 2 yamls for spark data source as of now, but coverage is not good 
> enough. We need more comprehensive ones similar to the ones we have for 
> deltastreamer. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2151) Make performant out-of-box configs

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2151:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)

> Make performant out-of-box configs
> --
>
> Key: HUDI-2151
> URL: https://issues.apache.org/jira/browse/HUDI-2151
> Project: Apache Hudi
>  Issue Type: Task
>  Components: code-quality, docs, writer-core
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> We have quite a few configs which deliver better performance or usability, 
> but guarded by flags. 
>  This is to identify them, change them, test (functionally, perf) and make 
> them default
>  
> Need to ensure we also capture all the backwards compatibility issues that 
> can arise



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2151) Make performant out-of-box configs

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2151:
-
Issue Type: Improvement  (was: Task)

> Make performant out-of-box configs
> --
>
> Key: HUDI-2151
> URL: https://issues.apache.org/jira/browse/HUDI-2151
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, docs, writer-core
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> We have quite a few configs which deliver better performance or usability, 
> but guarded by flags. 
>  This is to identify them, change them, test (functionally, perf) and make 
> them default
>  
> Need to ensure we also capture all the backwards compatibility issues that 
> can arise



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2151) Make performant out-of-box configs

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2151:
-
Priority: Blocker  (was: Critical)

> Make performant out-of-box configs
> --
>
> Key: HUDI-2151
> URL: https://issues.apache.org/jira/browse/HUDI-2151
> Project: Apache Hudi
>  Issue Type: Task
>  Components: code-quality, docs, writer-core
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> We have quite a few configs which deliver better performance or usability, 
> but guarded by flags. 
>  This is to identify them, change them, test (functionally, perf) and make 
> them default
>  
> Need to ensure we also capture all the backwards compatibility issues that 
> can arise



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-1887) Make schema post processor's default as disabled

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1887:
-
Issue Type: Improvement  (was: Task)

> Make schema post processor's default as disabled
> 
>
> Key: HUDI-1887
> URL: https://issues.apache.org/jira/browse/HUDI-1887
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: core-flow-ds, pull-request-available, sev:high, triaged
> Fix For: 0.12.0
>
>
> With default value [fix|https://github.com/apache/hudi/pull/2765], schema 
> post processor is not required as mandatory. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-1887) Make schema post processor's default as disabled

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1887:
-
Component/s: spark

> Make schema post processor's default as disabled
> 
>
> Key: HUDI-1887
> URL: https://issues.apache.org/jira/browse/HUDI-1887
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: core-flow-ds, pull-request-available, sev:high, triaged
> Fix For: 0.12.0
>
>
> With default value [fix|https://github.com/apache/hudi/pull/2765], schema 
> post processor is not required as mandatory. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-1887) Make schema post processor's default as disabled

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1887:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)

> Make schema post processor's default as disabled
> 
>
> Key: HUDI-1887
> URL: https://issues.apache.org/jira/browse/HUDI-1887
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: core-flow-ds, pull-request-available, sev:high, triaged
> Fix For: 0.12.0
>
>
> With default value [fix|https://github.com/apache/hudi/pull/2765], schema 
> post processor is not required as mandatory. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3741) Fix flink bucket index bulk insert generates too many small files

2022-03-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-3741:
-
Labels: pull-request-available  (was: )

> Fix flink bucket index bulk insert generates too many small files
> -
>
> Key: HUDI-3741
> URL: https://issues.apache.org/jira/browse/HUDI-3741
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink-sql
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] danny0405 opened a new pull request #5164: [HUDI-3741] Fix flink bucket index bulk insert generates too many sma…

2022-03-28 Thread GitBox



danny0405 opened a new pull request #5164:
URL: https://github.com/apache/hudi/pull/5164


   …ll files
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-3741) Fix flink bucket index bulk insert generates too many small files

2022-03-28 Thread Danny Chen (Jira)

Danny Chen created HUDI-3741:


 Summary: Fix flink bucket index bulk insert generates too many 
small files
 Key: HUDI-3741
 URL: https://issues.apache.org/jira/browse/HUDI-3741
 Project: Apache Hudi
  Issue Type: Improvement
  Components: flink-sql
Reporter: Danny Chen
 Fix For: 0.11.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-1549) Programmatic way to fetch earliest commit retained

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1549:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)

> Programmatic way to fetch earliest commit retained 
> ---
>
> Key: HUDI-1549
> URL: https://issues.apache.org/jira/browse/HUDI-1549
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: cleaning
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: query-eng, sev:normal, user-support-issues
> Fix For: 0.12.0
>
>
> For GDPR deletions, it would be nice if customers can programmatically know 
> whats the earliest commit retained. 
> More context: https://github.com/apache/hudi/issues/2135



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-1549) Programmatic way to fetch earliest commit retained

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1549:
-
Issue Type: New Feature  (was: Improvement)

> Programmatic way to fetch earliest commit retained 
> ---
>
> Key: HUDI-1549
> URL: https://issues.apache.org/jira/browse/HUDI-1549
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: cleaning
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: query-eng, sev:normal, user-support-issues
> Fix For: 0.11.0
>
>
> For GDPR deletions, it would be nice if customers can programmatically know 
> whats the earliest commit retained. 
> More context: https://github.com/apache/hudi/issues/2135



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-1549) Programmatic way to fetch earliest commit retained

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1549:
-
Component/s: timeline-server

> Programmatic way to fetch earliest commit retained 
> ---
>
> Key: HUDI-1549
> URL: https://issues.apache.org/jira/browse/HUDI-1549
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: cleaning, timeline-server
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: query-eng, sev:normal, user-support-issues
> Fix For: 0.12.0
>
>
> For GDPR deletions, it would be nice if customers can programmatically know 
> whats the earliest commit retained. 
> More context: https://github.com/apache/hudi/issues/2135



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-1038) Adding perf benchmark using jmh to Hudi

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1038:
-
Description: Add benchmark code to the repo to be reused.

> Adding perf benchmark using jmh to Hudi
> ---
>
> Key: HUDI-1038
> URL: https://issues.apache.org/jira/browse/HUDI-1038
> Project: Apache Hudi
>  Issue Type: Task
>  Components: performance
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Vinoth Chandar
>Priority: Critical
> Fix For: 0.12.0
>
>
> Add benchmark code to the repo to be reused.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-1038) Adding perf benchmark using jmh to Hudi

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1038:
-
Priority: Major  (was: Critical)

> Adding perf benchmark using jmh to Hudi
> ---
>
> Key: HUDI-1038
> URL: https://issues.apache.org/jira/browse/HUDI-1038
> Project: Apache Hudi
>  Issue Type: Task
>  Components: performance
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.12.0
>
>
> Add benchmark code to the repo to be reused.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-1038) Adding perf benchmark using jmh to Hudi

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1038:
-
Priority: Critical  (was: Major)

> Adding perf benchmark using jmh to Hudi
> ---
>
> Key: HUDI-1038
> URL: https://issues.apache.org/jira/browse/HUDI-1038
> Project: Apache Hudi
>  Issue Type: Task
>  Components: performance
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Vinoth Chandar
>Priority: Critical
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-945) Cleanup spillable map files eagerly as part of close

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-945:

Priority: Blocker  (was: Major)

> Cleanup spillable map files eagerly as part of close
> 
>
> Key: HUDI-945
> URL: https://issues.apache.org/jira/browse/HUDI-945
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Balaji Varadarajan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: core-flow-ds, pull-request-available, sev:high
> Fix For: 0.12.0
>
>
> Currently, files used by external spillable map are deleted on exits. For 
> spark-streaming/deltastreamer continuous-mode cases which runs several 
> iterations, it is better to eagerly delete files on closing the handles using 
> it. 
> We need to eagerly delete the files on following cases:
>  # HoodieMergeHandle
>  # HoodieMergedLogRecordScanner
>  # SpillableMapBasedFileSystemView



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-1038) Adding perf benchmark using jmh to Hudi

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1038:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)

> Adding perf benchmark using jmh to Hudi
> ---
>
> Key: HUDI-1038
> URL: https://issues.apache.org/jira/browse/HUDI-1038
> Project: Apache Hudi
>  Issue Type: Task
>  Components: performance
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-945) Cleanup spillable map files eagerly as part of close

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-945:

Fix Version/s: 0.12.0
   (was: 0.11.0)

> Cleanup spillable map files eagerly as part of close
> 
>
> Key: HUDI-945
> URL: https://issues.apache.org/jira/browse/HUDI-945
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Balaji Varadarajan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: core-flow-ds, pull-request-available, sev:high
> Fix For: 0.12.0
>
>
> Currently, files used by external spillable map are deleted on exits. For 
> spark-streaming/deltastreamer continuous-mode cases which runs several 
> iterations, it is better to eagerly delete files on closing the handles using 
> it. 
> We need to eagerly delete the files on following cases:
>  # HoodieMergeHandle
>  # HoodieMergedLogRecordScanner
>  # SpillableMapBasedFileSystemView



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] hudi-bot commented on pull request #4962: [HUDI-3355] Issue with out of order commits in the timeline when ingestion writers using SparkAllowUpdateStrategy

2022-03-28 Thread GitBox



hudi-bot commented on pull request #4962:
URL: https://github.com/apache/hudi/pull/4962#issuecomment-1081401588


   
   ## CI report:
   
   * bb65f08889055d1ed1908b858a398a98e9bfac64 UNKNOWN
   * bd83cf3b8dcf7ae81e54c1d0c9b19e75aa087eec UNKNOWN
   * 2a8b30e4c3361e7ccfc528be2c455008f56578eb UNKNOWN
   * 374d82baa881f52a426c32968c2a26efcbe19c48 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7485)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4962: [HUDI-3355] Issue with out of order commits in the timeline when ingestion writers using SparkAllowUpdateStrategy

2022-03-28 Thread GitBox



hudi-bot removed a comment on pull request #4962:
URL: https://github.com/apache/hudi/pull/4962#issuecomment-1081319613


   
   ## CI report:
   
   * bb65f08889055d1ed1908b858a398a98e9bfac64 UNKNOWN
   * bd83cf3b8dcf7ae81e54c1d0c9b19e75aa087eec UNKNOWN
   * 2a8b30e4c3361e7ccfc528be2c455008f56578eb UNKNOWN
   * 02d695df4cb6bc84dbead81683167c8fe7034a23 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7456)
 
   * 374d82baa881f52a426c32968c2a26efcbe19c48 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7485)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3738) Perf comparison between parquet and hudi for COW snapshot and MOR read optimized

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3738:
-
Sprint: Hudi-Sprint-Mar-22

> Perf comparison between parquet and hudi for COW snapshot and MOR read 
> optimized
> 
>
> Key: HUDI-3738
> URL: https://issues.apache.org/jira/browse/HUDI-3738
> Project: Apache Hudi
>  Issue Type: Task
>  Components: performance
>Reporter: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3650) Revisit all usages of filterPendingCompactionTimeline()

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3650:
-
Sprint: Hudi-Sprint-Mar-22

> Revisit all usages of filterPendingCompactionTimeline() 
> 
>
> Key: HUDI-3650
> URL: https://issues.apache.org/jira/browse/HUDI-3650
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Yue Zhang
>Priority: Blocker
> Fix For: 0.11.0
>
>
> [https://github.com/apache/hudi/pull/4172/files]
>  
> We need to find all usages of filterPendingCompactionTimeline, 
> getTimelineOfActions and replace them with new methods.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3135) Fix Show Partitions Command's Result after drop partition

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3135:
-
Priority: Blocker  (was: Critical)

> Fix Show Partitions Command's Result after drop partition
> -
>
> Key: HUDI-3135
> URL: https://issues.apache.org/jira/browse/HUDI-3135
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, spark-sql
>Reporter: Forward Xu
>Assignee: Forward Xu
>Priority: Blocker
>  Labels: pull-request-available, user-support-issues
> Fix For: 0.11.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> # add two partitions  dt='2021-10-01',  dt='2021-10-02'
>  # drop one partition dt='2021-10-01'
>  # show partitions ,The query result: dt='2021-10-01',  dt='2021-10-02' The 
> expected result is: dt='2021-10-02'



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3135) Fix Show Partitions Command's Result after drop partition

2022-03-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3135:
-
Sprint: Cont' improve -  2021/01/10, Cont' improve -  2021/01/18, Cont' 
improve -  2021/01/24, Cont' improve -  2021/01/31, Cont' improve -  
2022/02/07, Cont' improve -  2022/02/14, Cont' improve -  2022/02/21, Cont' 
improve - 2022/03/01, Hudi-Sprint-Mar-22  (was: Cont' improve -  2021/01/10, 
Cont' improve -  2021/01/18, Cont' improve -  2021/01/24, Cont' improve -  
2021/01/31, Cont' improve -  2022/02/07, Cont' improve -  2022/02/14, Cont' 
improve -  2022/02/21, Cont' improve - 2022/03/01, Cont' improve - 2022/03/7)

> Fix Show Partitions Command's Result after drop partition
> -
>
> Key: HUDI-3135
> URL: https://issues.apache.org/jira/browse/HUDI-3135
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, spark-sql
>Reporter: Forward Xu
>Assignee: Forward Xu
>Priority: Blocker
>  Labels: pull-request-available, user-support-issues
> Fix For: 0.11.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> # add two partitions  dt='2021-10-01',  dt='2021-10-02'
>  # drop one partition dt='2021-10-01'
>  # show partitions ,The query result: dt='2021-10-01',  dt='2021-10-02' The 
> expected result is: dt='2021-10-02'



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (HUDI-3650) Revisit all usages of filterPendingCompactionTimeline()

2022-03-28 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-3650:
---

Assignee: Yue Zhang

> Revisit all usages of filterPendingCompactionTimeline() 
> 
>
> Key: HUDI-3650
> URL: https://issues.apache.org/jira/browse/HUDI-3650
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Yue Zhang
>Priority: Blocker
> Fix For: 0.11.0
>
>
> [https://github.com/apache/hudi/pull/4172/files]
>  
> We need to find all usages of filterPendingCompactionTimeline, 
> getTimelineOfActions and replace them with new methods.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3650) Revisit all usages of filterPendingCompactionTimeline()

2022-03-28 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3650:

Priority: Blocker  (was: Critical)

> Revisit all usages of filterPendingCompactionTimeline() 
> 
>
> Key: HUDI-3650
> URL: https://issues.apache.org/jira/browse/HUDI-3650
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> [https://github.com/apache/hudi/pull/4172/files]
>  
> We need to find all usages of filterPendingCompactionTimeline, 
> getTimelineOfActions and replace them with new methods.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

1 2 3 4 5 6 7 8 >

1 - 100 of 723 matches

Mail list logo