[GitHub] [hudi] weimingdiit commented on pull request #7281: [HUDI-5270] Duplicate key error when insert_overwrite same partition …

2023-02-01 Thread via GitHub


weimingdiit commented on PR #7281:
URL: https://github.com/apache/hudi/pull/7281#issuecomment-1413269803

   @danny0405  hi,danny, Could you please help review this pr ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-5685) Fix performance gap in Bulk Insert row-writing path with enabled de-duplication

2023-02-01 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5685:
-

 Summary: Fix performance gap in Bulk Insert row-writing path with 
enabled de-duplication
 Key: HUDI-5685
 URL: https://issues.apache.org/jira/browse/HUDI-5685
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
 Fix For: 0.13.0


Currently, in case flag {{hoodie.combine.before.insert}} is set to true and 
{{hoodie.bulkinsert.sort.mode}} is set to {{{}NONE{}}}, Bulk Insert Row Writing 
performance will considerably degrade due to the following circumstances
 * During de-duplication (w/in {{{}dedupRows{}}}) records in the incoming RDD 
would be reshuffled (by Spark's default {{{}HashPartitioner{}}}) based on 
{{(partition-path, record-key)}} into N partitions
 * In case {{BulkInsertSortMode.NONE}} is used as partitioner, no 
re-partitioning will be performed and therefore each Spark task might be 
writing into M table partitions
 * This in turn entails explosion in the number of created (small) files, 
killing performance and table's layout



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5685) Fix performance gap in Bulk Insert row-writing path with enabled de-duplication

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5685:
--
Sprint: Sprint 2023-01-31

> Fix performance gap in Bulk Insert row-writing path with enabled 
> de-duplication
> ---
>
> Key: HUDI-5685
> URL: https://issues.apache.org/jira/browse/HUDI-5685
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Currently, in case flag {{hoodie.combine.before.insert}} is set to true and 
> {{hoodie.bulkinsert.sort.mode}} is set to {{{}NONE{}}}, Bulk Insert Row 
> Writing performance will considerably degrade due to the following 
> circumstances
>  * During de-duplication (w/in {{{}dedupRows{}}}) records in the incoming RDD 
> would be reshuffled (by Spark's default {{{}HashPartitioner{}}}) based on 
> {{(partition-path, record-key)}} into N partitions
>  * In case {{BulkInsertSortMode.NONE}} is used as partitioner, no 
> re-partitioning will be performed and therefore each Spark task might be 
> writing into M table partitions
>  * This in turn entails explosion in the number of created (small) files, 
> killing performance and table's layout



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] pushpavanthar commented on issue #7757: [SUPPORT] missing records when HoodieDeltaStreamer run in continuous mode

2023-02-01 Thread via GitHub


pushpavanthar commented on issue #7757:
URL: https://github.com/apache/hudi/issues/7757#issuecomment-1413264207

   @codope I have identified root cause and fix for the same, can you help me 
create a jira and assign to me. I'll update the approach to fix this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5684) Fix CTAS to make combine-on-insert configurable

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5684:
--
Status: Patch Available  (was: In Progress)

> Fix CTAS to make combine-on-insert configurable
> ---
>
> Key: HUDI-5684
> URL: https://issues.apache.org/jira/browse/HUDI-5684
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently, CTAS sets `COMBINE_ON_INSERT` config value whenever target table 
> has pre-combine key specified.
> However, it's currently done in a way that doesn't allow it to be overridden 
> by user-provided configuration. We need to address that



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5684) Fix CTAS to make combine-on-insert configurable

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5684:
--
Status: In Progress  (was: Open)

> Fix CTAS to make combine-on-insert configurable
> ---
>
> Key: HUDI-5684
> URL: https://issues.apache.org/jira/browse/HUDI-5684
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently, CTAS sets `COMBINE_ON_INSERT` config value whenever target table 
> has pre-combine key specified.
> However, it's currently done in a way that doesn't allow it to be overridden 
> by user-provided configuration. We need to address that



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5681) Merge Into fails while deserializing expressions

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-5681.
-
Resolution: Fixed

> Merge Into fails while deserializing expressions
> 
>
> Key: HUDI-5681
> URL: https://issues.apache.org/jira/browse/HUDI-5681
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> While running our benchmark suite against 0.13 RC, we've stumbled upon 
> following exceptions:
> {code:java}
> 23/02/01 08:29:01 ERROR TaskSetManager: Task 1 in stage 947.0 failed 4 times; 
> aborting job
> 2023-02-01T08:29:01.219 ERROR: merge:1:inventory
> Job aborted due to stage failure: Task 1 in stage 947.0 failed 4 times, most 
> recent failure: Lost task 1.3 in stage 947.0 (TID 101955) 
> (ip-172-31-18-9.us-west-2.compute.internal executor 140): 
> org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
> UPDATE for partition :1
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:336)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleInsertPartition(BaseSparkCommitActionExecutor.java:342)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:253)
>   at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378)
>   at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1525)
>   at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:138)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 
> org.apache.spark.sql.catalyst.expressions.Literal
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804)
>   at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:221)
>   at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:199)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:408)
>   at org.apache.spark.sql.hudi.SerDeUtils$.toObject(SerDeUtils.scala:42)
>   at 
> org.apache.spark.sql.hudi.command.payload.ExpressionPayload$$anon$7.apply(ExpressionPayload.scala:423)
>   

[GitHub] [hudi] weimingdiit commented on pull request #7362: [HUDI-5315] The record size is dynamically estimated when the table i…

2023-02-01 Thread via GitHub


weimingdiit commented on PR #7362:
URL: https://github.com/apache/hudi/pull/7362#issuecomment-1413253602

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5647) Automate savepoint and restore tests

2023-02-01 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-5647:
-
Fix Version/s: 0.13.1

> Automate savepoint and restore tests
> 
>
> Key: HUDI-5647
> URL: https://issues.apache.org/jira/browse/HUDI-5647
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: Danny Chen
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> Automate savepoint and restore tests
> Scenarios to cover:
>  
> All tests to be done for
> w/ and w/o metadata
> partitioned and non-partitioned dataset. 
> COW
> Format:
> scenario being tested
> timeline 
> what to expect after restore. 
> 1. straight forward
> C1, C2, savepoint C2. C3, C4, restore. 
> should go back to C2. 
> C3, C4 should be cleaned up. 
> 2. pending inflight. 
> C1, C2, savepoint C2. C3, C4 inflight. restore. 
> should go back to C2. 
> C3, C4 should be cleaned up. 
> 3. completed rollbacks in timeline. 
> C1, C2, savepoint C2, C3, C4 (RB_C3), C5. restore. 
> should go back to C2. 
> C3, C4(RB_C3), C5 should be cleaned up. 
> 4. pending rollbacks after savepoint. 
> C1, C2, savepoint C2, C3, C4 (RB_C3) inflight. restore. 
> should go back to C2. 
> C3, C4 (RB_C3) should be cleaned up. 
> 5. clean commits after savepoint. 
> C1, C2, savepoint C2, C3, C4, C5 (clean C1), C6, restore
> should go back to C2. 
> C3, C4, C5 (clean C1), C6 should be cleaned up.
> 6. clustering. 
> C1, C2, savepoint C2. C3, C4.replace commit, C5, restore. 
> should go back to C2. 
> C3, C4.replace commit, C5 should be cleaned up. 
> 7. pending clustering after savepoint. 
> C1, C2, savepoint C2. C3, C4.replace commit.inflight, C5, restore. 
> should go back to C2.
> C3, C4.replace commit files and C5 files should be cleaned up. 
> 8. completed clustering before savepoint. 
> C1, C2, C3.replacecommit.complete, C4, savepoint C4, C5, restore. 
> should go back to C4.
> C5 should be cleaned up. 
> 9. pending clustering before savepoint. 
> C1, C2, C3.replace commit.inflight, C3, C4, savepoint C4, C5, restore 
> should go back to C4. 
> C4 should be cleaned up. if pipeline is restarted, C3.replace commit should 
> be re-attempted. 
> MOR 
> 1. simple one
> DC1, DC2, DC3, savepoint DC3. DC4, DC5. restore
> should rollback DC4 and DC5 
> No files will be cleaned up. only rollback log appends. 
> 2. simple one w/ compaction. 
> DC1, DC2, DC3, C4, savepoint C4. DC5, DC6. restore
> should rollback DC5 and DC6 
> No files will be cleaned up. only rollback log appends. 
> 3. another one w/ compaction. 
> DC1, DC2, DC3, savepoint DC3, DC4, C5, DC6, DC7. restore
> should rollback DC5 and DC6. 
> latest file slice should be fully cleaned up. and rollback log appends for 
> DC4 in first file slice. 
> 4. compaction and clean commits. 
> DC1, DC2, DC3, savepoint DC3, DC4, C5, DC6, DC7, DC8, C9, C10.clean, DC11, 
> DC12 restore. 
> should take the table back to DC3. 
> Cleaner should not have cleaned up file slice 1 since it was part of 
> savepoint. Entire file slice 2 and 3 should be cleaned up. 
> i.e. C5, DC6, DC7, DC8, C9, C10.clean, DC11, DC12. and a rollback log append 
> for DC4. 
> 5. pending compaction after savepoint. 
> DC1, DC2, DC3, savepoint DC3, DC4, C5.pending. DC6, DC7. restore
> should rollback until DC3. 
> latest file slice should be fully delete. for DC4 a rollback log append 
> should be made. 
> 6. pending compaction before savepoint. 
> DC1, DC2, DC3, C4.pending, DC5, savepoint DC5, DC6, DC7. restore
> should rollback until DC5. 
> rollback log appends for DC6 and DC7. 
> 7. compaction and clustering. completed clustering before savepoint. 
> DC1, DC2, DC3, C4, DC5, C6.replacecommit.completed. DC7, savepoint DC7, DC8, 
> DC9. restore
> inpsect what C6 does. likely it will create a new file group. and then start 
> taking in DC7. 
> should take the table back to DC7. 
> rollback log appends for DC8 and DC9. 
> 8. compaction and clustering. completed clustering after savepoint. 
> DC1, DC2, DC3, C4, DC5, savepoint DC5, C6.replacecommit.completed, DC7, DC8, 
> restore
> inpsect what C6 does. likely it will create a new file group. and then start 
> taking in DC7. 
> should take the table back to DC5. 
> latest file slice created by C6 should be cleaned up fully. 
> 9. pending clustering before savepoint. 
> DC1, DC2, DC3, C4, DC5, C6.replacecommit.inflight. DC7, savepoint DC7, DC8, 
> DC9. restore
> should take the table back to DC7. 
> rollback log appends for DC8 and DC9. when pipeline is restarted, C6 should 
> be re-attempted and get to completion. 
> 10. pending clustering after savepoint. 
> DC1, DC2, DC3, C4, DC5, savepoint DC5, C6.replacecommit.inflight, DC7, DC8, 
> restore
> should take the table back to DC5. 
> 

[jira] [Closed] (HUDI-5647) Automate savepoint and restore tests

2023-02-01 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-5647.

Resolution: Fixed

Fixed via master branch: 6fbf9d4f840b1079877bd6d2e649678a9e01b715

> Automate savepoint and restore tests
> 
>
> Key: HUDI-5647
> URL: https://issues.apache.org/jira/browse/HUDI-5647
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: Danny Chen
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> Automate savepoint and restore tests
> Scenarios to cover:
>  
> All tests to be done for
> w/ and w/o metadata
> partitioned and non-partitioned dataset. 
> COW
> Format:
> scenario being tested
> timeline 
> what to expect after restore. 
> 1. straight forward
> C1, C2, savepoint C2. C3, C4, restore. 
> should go back to C2. 
> C3, C4 should be cleaned up. 
> 2. pending inflight. 
> C1, C2, savepoint C2. C3, C4 inflight. restore. 
> should go back to C2. 
> C3, C4 should be cleaned up. 
> 3. completed rollbacks in timeline. 
> C1, C2, savepoint C2, C3, C4 (RB_C3), C5. restore. 
> should go back to C2. 
> C3, C4(RB_C3), C5 should be cleaned up. 
> 4. pending rollbacks after savepoint. 
> C1, C2, savepoint C2, C3, C4 (RB_C3) inflight. restore. 
> should go back to C2. 
> C3, C4 (RB_C3) should be cleaned up. 
> 5. clean commits after savepoint. 
> C1, C2, savepoint C2, C3, C4, C5 (clean C1), C6, restore
> should go back to C2. 
> C3, C4, C5 (clean C1), C6 should be cleaned up.
> 6. clustering. 
> C1, C2, savepoint C2. C3, C4.replace commit, C5, restore. 
> should go back to C2. 
> C3, C4.replace commit, C5 should be cleaned up. 
> 7. pending clustering after savepoint. 
> C1, C2, savepoint C2. C3, C4.replace commit.inflight, C5, restore. 
> should go back to C2.
> C3, C4.replace commit files and C5 files should be cleaned up. 
> 8. completed clustering before savepoint. 
> C1, C2, C3.replacecommit.complete, C4, savepoint C4, C5, restore. 
> should go back to C4.
> C5 should be cleaned up. 
> 9. pending clustering before savepoint. 
> C1, C2, C3.replace commit.inflight, C3, C4, savepoint C4, C5, restore 
> should go back to C4. 
> C4 should be cleaned up. if pipeline is restarted, C3.replace commit should 
> be re-attempted. 
> MOR 
> 1. simple one
> DC1, DC2, DC3, savepoint DC3. DC4, DC5. restore
> should rollback DC4 and DC5 
> No files will be cleaned up. only rollback log appends. 
> 2. simple one w/ compaction. 
> DC1, DC2, DC3, C4, savepoint C4. DC5, DC6. restore
> should rollback DC5 and DC6 
> No files will be cleaned up. only rollback log appends. 
> 3. another one w/ compaction. 
> DC1, DC2, DC3, savepoint DC3, DC4, C5, DC6, DC7. restore
> should rollback DC5 and DC6. 
> latest file slice should be fully cleaned up. and rollback log appends for 
> DC4 in first file slice. 
> 4. compaction and clean commits. 
> DC1, DC2, DC3, savepoint DC3, DC4, C5, DC6, DC7, DC8, C9, C10.clean, DC11, 
> DC12 restore. 
> should take the table back to DC3. 
> Cleaner should not have cleaned up file slice 1 since it was part of 
> savepoint. Entire file slice 2 and 3 should be cleaned up. 
> i.e. C5, DC6, DC7, DC8, C9, C10.clean, DC11, DC12. and a rollback log append 
> for DC4. 
> 5. pending compaction after savepoint. 
> DC1, DC2, DC3, savepoint DC3, DC4, C5.pending. DC6, DC7. restore
> should rollback until DC3. 
> latest file slice should be fully delete. for DC4 a rollback log append 
> should be made. 
> 6. pending compaction before savepoint. 
> DC1, DC2, DC3, C4.pending, DC5, savepoint DC5, DC6, DC7. restore
> should rollback until DC5. 
> rollback log appends for DC6 and DC7. 
> 7. compaction and clustering. completed clustering before savepoint. 
> DC1, DC2, DC3, C4, DC5, C6.replacecommit.completed. DC7, savepoint DC7, DC8, 
> DC9. restore
> inpsect what C6 does. likely it will create a new file group. and then start 
> taking in DC7. 
> should take the table back to DC7. 
> rollback log appends for DC8 and DC9. 
> 8. compaction and clustering. completed clustering after savepoint. 
> DC1, DC2, DC3, C4, DC5, savepoint DC5, C6.replacecommit.completed, DC7, DC8, 
> restore
> inpsect what C6 does. likely it will create a new file group. and then start 
> taking in DC7. 
> should take the table back to DC5. 
> latest file slice created by C6 should be cleaned up fully. 
> 9. pending clustering before savepoint. 
> DC1, DC2, DC3, C4, DC5, C6.replacecommit.inflight. DC7, savepoint DC7, DC8, 
> DC9. restore
> should take the table back to DC7. 
> rollback log appends for DC8 and DC9. when pipeline is restarted, C6 should 
> be re-attempted and get to completion. 
> 10. pending clustering after savepoint. 
> DC1, DC2, DC3, C4, DC5, savepoint DC5, C6.replacecommit.inflight, 

[hudi] branch master updated (abe26d4169c -> 6fbf9d4f840)

2023-02-01 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from abe26d4169c [HUDI-5676] Fix BigQuerySyncTool standalone mode (#7816)
 add 6fbf9d4f840 [HUDI-5647] Automate savepoint and restore tests (#7796)

No new revisions were added by this update.

Summary of changes:
 .../TestSavepointRestoreCopyOnWrite.java   | 173 ++
 .../TestSavepointRestoreMergeOnRead.java   | 248 +
 .../hudi/testutils/HoodieClientTestBase.java   |  62 +-
 3 files changed, 482 insertions(+), 1 deletion(-)
 create mode 100644 
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestSavepointRestoreCopyOnWrite.java
 create mode 100644 
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestSavepointRestoreMergeOnRead.java



[GitHub] [hudi] danny0405 merged pull request #7796: [HUDI-5647] Automate savepoint and restore tests

2023-02-01 Thread via GitHub


danny0405 merged PR #7796:
URL: https://github.com/apache/hudi/pull/7796


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7825: [DNM] Fixing deduplication in Bulk Insert row-writing path

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7825:
URL: https://github.com/apache/hudi/pull/7825#issuecomment-1413230305

   
   ## CI report:
   
   * db6d970776ab0dcf2f5b29c54137bd0796202077 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14860)
 
   * 1cb6c17f6b37b86f122dd22d8791013e553a8493 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14864)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7813: [HUDI-5684] Fix CTAS and Insert Into to avoid combine-on-insert by default

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7813:
URL: https://github.com/apache/hudi/pull/7813#issuecomment-1413230238

   
   ## CI report:
   
   * 49ddad424eff8fc009fc3f698d9bce7de3d5ccbe UNKNOWN
   * a7104faad440c94bfae085857cd583ade8fd8e46 UNKNOWN
   * a3b0274bd9dc76d29d1a169127d3e13ea89fd23c Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14859)
 
   * 3ff4e90e07f0287715914d33f323a7ef6a58b440 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14862)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7633: Fix Deletes issued without any prior commits

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7633:
URL: https://github.com/apache/hudi/pull/7633#issuecomment-1413229832

   
   ## CI report:
   
   *  Unknown: [CANCELED](TBD) 
   * a3e25d91fe89abb52b2019c5f5a68f28a321a1f8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14863)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-5676] Fix BigQuerySyncTool standalone mode (#7816)

2023-02-01 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new abe26d4169c [HUDI-5676] Fix BigQuerySyncTool standalone mode (#7816)
abe26d4169c is described below

commit abe26d4169c04da05b99941161621876e3569e96
Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Thu Feb 2 00:39:28 2023 -0600

[HUDI-5676] Fix BigQuerySyncTool standalone mode (#7816)
---
 .../hudi/gcp/bigquery/BigQuerySyncConfig.java  | 38 
 .../gcp/bigquery/TestBigQuerySyncToolArgs.java | 70 ++
 packaging/hudi-gcp-bundle/pom.xml  |  8 ++-
 3 files changed, 90 insertions(+), 26 deletions(-)

diff --git 
a/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java 
b/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java
index b46cd9a9f81..52b3d3b74e5 100644
--- 
a/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java
+++ 
b/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java
@@ -20,14 +20,13 @@
 package org.apache.hudi.gcp.bigquery;
 
 import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.sync.common.HoodieSyncConfig;
 
 import com.beust.jcommander.Parameter;
 import com.beust.jcommander.ParametersDelegate;
 
 import java.io.Serializable;
-import java.util.ArrayList;
-import java.util.List;
 import java.util.Properties;
 
 /**
@@ -101,38 +100,27 @@ public class BigQuerySyncConfig extends HoodieSyncConfig 
implements Serializable
 public String datasetName;
 @Parameter(names = {"--dataset-location"}, description = "Location of the 
target dataset in BigQuery", required = true)
 public String datasetLocation;
-@Parameter(names = {"--table-name"}, description = "Name of the target 
table in BigQuery", required = true)
-public String tableName;
 @Parameter(names = {"--source-uri"}, description = "Name of the source uri 
gcs path of the table", required = true)
 public String sourceUri;
 @Parameter(names = {"--source-uri-prefix"}, description = "Name of the 
source uri gcs path prefix of the table", required = true)
 public String sourceUriPrefix;
-@Parameter(names = {"--base-path"}, description = "Base path of the hoodie 
table to sync", required = true)
-public String basePath;
-@Parameter(names = {"--partitioned-by"}, description = "Comma-delimited 
partition fields. Default to non-partitioned.")
-public List partitionFields = new ArrayList<>();
-@Parameter(names = {"--use-file-listing-from-metadata"}, description = 
"Fetch file listing from Hudi's metadata")
-public boolean useFileListingFromMetadata = false;
-@Parameter(names = {"--assume-date-partitioning"}, description = "Assume 
standard /mm/dd partitioning, this"
-+ " exists to support backward compatibility. If you use hoodie 0.3.x, 
do not set this parameter")
-public boolean assumeDatePartitioning = false;
 
 public boolean isHelp() {
   return hoodieSyncConfigParams.isHelp();
 }
 
-public Properties toProps() {
-  final Properties props = hoodieSyncConfigParams.toProps();
-  props.setProperty(BIGQUERY_SYNC_PROJECT_ID.key(), projectId);
-  props.setProperty(BIGQUERY_SYNC_DATASET_NAME.key(), datasetName);
-  props.setProperty(BIGQUERY_SYNC_DATASET_LOCATION.key(), datasetLocation);
-  props.setProperty(BIGQUERY_SYNC_TABLE_NAME.key(), tableName);
-  props.setProperty(BIGQUERY_SYNC_SOURCE_URI.key(), sourceUri);
-  props.setProperty(BIGQUERY_SYNC_SOURCE_URI_PREFIX.key(), 
sourceUriPrefix);
-  props.setProperty(BIGQUERY_SYNC_SYNC_BASE_PATH.key(), basePath);
-  props.setProperty(BIGQUERY_SYNC_PARTITION_FIELDS.key(), String.join(",", 
partitionFields));
-  props.setProperty(BIGQUERY_SYNC_USE_FILE_LISTING_FROM_METADATA.key(), 
String.valueOf(useFileListingFromMetadata));
-  props.setProperty(BIGQUERY_SYNC_ASSUME_DATE_PARTITIONING.key(), 
String.valueOf(assumeDatePartitioning));
+public TypedProperties toProps() {
+  final TypedProperties props = hoodieSyncConfigParams.toProps();
+  props.setPropertyIfNonNull(BIGQUERY_SYNC_PROJECT_ID.key(), projectId);
+  props.setPropertyIfNonNull(BIGQUERY_SYNC_DATASET_NAME.key(), 
datasetName);
+  props.setPropertyIfNonNull(BIGQUERY_SYNC_DATASET_LOCATION.key(), 
datasetLocation);
+  props.setPropertyIfNonNull(BIGQUERY_SYNC_TABLE_NAME.key(), 
hoodieSyncConfigParams.tableName);
+  props.setPropertyIfNonNull(BIGQUERY_SYNC_SOURCE_URI.key(), sourceUri);
+  props.setPropertyIfNonNull(BIGQUERY_SYNC_SOURCE_URI_PREFIX.key(), 
sourceUriPrefix);
+  props.setPropertyIfNonNull(BIGQUERY_SYNC_SYNC_BASE_PATH.key(), 
hoodieSyncConfigParams.basePath);
+  

[GitHub] [hudi] xushiyan merged pull request #7816: [HUDI-5676] Fix BigQuerySyncTool standalone mode

2023-02-01 Thread via GitHub


xushiyan merged PR #7816:
URL: https://github.com/apache/hudi/pull/7816


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on pull request #7816: [HUDI-5676] Fix BigQuerySyncTool standalone mode

2023-02-01 Thread via GitHub


xushiyan commented on PR #7816:
URL: https://github.com/apache/hudi/pull/7816#issuecomment-1413226235

   CI failure is irrelevant. The module tests passed.
   
   ```
   [INFO] hudi-gcp ... SUCCESS [  2.837 
s]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5684) Fix CTAS to make combine-on-insert configurable

2023-02-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5684:
-
Labels: pull-request-available  (was: )

> Fix CTAS to make combine-on-insert configurable
> ---
>
> Key: HUDI-5684
> URL: https://issues.apache.org/jira/browse/HUDI-5684
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently, CTAS sets `COMBINE_ON_INSERT` config value whenever target table 
> has pre-combine key specified.
> However, it's currently done in a way that doesn't allow it to be overridden 
> by user-provided configuration. We need to address that



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #7825: [DNM] Fixing deduplication in Bulk Insert row-writing path

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7825:
URL: https://github.com/apache/hudi/pull/7825#issuecomment-1413225019

   
   ## CI report:
   
   * db6d970776ab0dcf2f5b29c54137bd0796202077 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14860)
 
   * 1cb6c17f6b37b86f122dd22d8791013e553a8493 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7813: [HUDI-5684] Fix CTAS and Insert Into to avoid combine-on-insert by default

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7813:
URL: https://github.com/apache/hudi/pull/7813#issuecomment-1413224937

   
   ## CI report:
   
   * 49ddad424eff8fc009fc3f698d9bce7de3d5ccbe UNKNOWN
   * a7104faad440c94bfae085857cd583ade8fd8e46 UNKNOWN
   * cab28490849cfc1288af59b77bd8986b58346782 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14857)
 
   * a3b0274bd9dc76d29d1a169127d3e13ea89fd23c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14859)
 
   * 3ff4e90e07f0287715914d33f323a7ef6a58b440 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7633: Fix Deletes issued without any prior commits

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7633:
URL: https://github.com/apache/hudi/pull/7633#issuecomment-1413224514

   
   ## CI report:
   
   *  Unknown: [CANCELED](TBD) 
   * a3e25d91fe89abb52b2019c5f5a68f28a321a1f8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] TengHuo commented on issue #7691: [SUPPORT] Flink's schema conflicts with spark's schema.

2023-02-01 Thread via GitHub


TengHuo commented on issue #7691:
URL: https://github.com/apache/hudi/issues/7691#issuecomment-1413224065

   > @TengHuo @danny0405 What's the followup here? If it's a bug, is it going 
to be fixed in 0.13.0?
   
   Hi @codope, think it is a inconsistent behaviour issue between Spark and 
Flink. We may need to fix it in Spark and Flink side at the same time. I added 
some detail in PR: 
https://github.com/apache/hudi/pull/7307#issuecomment-1413220811
   
   Do you have any suggestions about how to fix it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] TengHuo commented on pull request #7307: [HUDI-5271] fix issue inconsistent reader and writer schema in HoodieAvroDataBlock

2023-02-01 Thread via GitHub


TengHuo commented on PR #7307:
URL: https://github.com/apache/hudi/pull/7307#issuecomment-1413220811

   Agree @danny0405, think it's better we unify Avro schema handling across 
Spark and Flink in Hudi.
   
   Currently, we have Avro schema tools class 
`org.apache.hudi.avro.AvroSchemaUtils` in module `hudi-common` to manipulate 
Avro schema. And Hudi Spark is using 
`org.apache.spark.sql.avro.SchemaConverters` to do conversion between Spark 
DataType and Avro schema. Hudi Flink is using ` 
org.apache.hudi.util.AvroSchemaConverter` to do conversion between Flink 
DataType and Avro schema.
   
   I noticed that there is different behaviour when setting the name of a new 
Avro schema.
   
   **In Spark side**, it exposes the name and namespace of Avro schema as 
method parameter.
   
   ```java
 /**
  * Converts a Spark SQL schema to a corresponding Avro schema.
  *
  * @since 2.4.0
  */
 def toAvroType(catalystType: DataType,
nullable: Boolean = false,
recordName: String = "topLevelRecord",
nameSpace: String = ""): Schema
   ```
   
   reference: 
https://github.com/apache/hudi/blob/41653fc708854828bacb23ed624ca6b3a67d6737/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L154
   
   **In Flink side**, it uses a constant name `"record"`
   
   ```java
 /**
  * Converts Flink SQL {@link LogicalType} (can be nested) into an Avro 
schema.
  *
  * Use "record" as the type name.
  *
  * @param schema the schema type, usually it should be the top level 
record type, e.g. not a
  *   nested type
  * @return Avro's {@link Schema} matching this logical type.
  */
 public static Schema convertToSchema(LogicalType schema) {
   return convertToSchema(schema, "record");
 }
   ```
   
   reference: 
https://github.com/apache/hudi/blob/8ffcb2fc9470077bdcf3810756545d081fb6523c/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/AvroSchemaConverter.java#L202
 
   
   (please correct me if I'm wrong)
   
   May I know if it is possible we unify all non engine related schemas things 
in one place? e.g. name conversion rule


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7816: [HUDI-5676] Fix BigQuerySyncTool standalone mode

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7816:
URL: https://github.com/apache/hudi/pull/7816#issuecomment-1413218642

   
   ## CI report:
   
   * eba58cdddcd5a83b5843bd8da41ba43b45435210 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14858)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5676) BigQuerySyncTool param conflicts

2023-02-01 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5676:
-
Status: In Progress  (was: Open)

> BigQuerySyncTool param conflicts
> 
>
> Key: HUDI-5676
> URL: https://issues.apache.org/jira/browse/HUDI-5676
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: meta-sync
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> When running standalone mode, ran into issue with
> Exception in thread "main" com.beust.jcommander.ParameterException: Found the 
> option --base-path multiple times
> also missing hive-sync module that contains partition key extractors



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5676) BigQuerySyncTool param conflicts

2023-02-01 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5676:
-
Status: Patch Available  (was: In Progress)

> BigQuerySyncTool param conflicts
> 
>
> Key: HUDI-5676
> URL: https://issues.apache.org/jira/browse/HUDI-5676
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: meta-sync
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> When running standalone mode, ran into issue with
> Exception in thread "main" com.beust.jcommander.ParameterException: Found the 
> option --base-path multiple times
> also missing hive-sync module that contains partition key extractors



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5684) Fix CTAS to make combine-on-insert configurable

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5684:
--
Sprint: Sprint 2023-01-31

> Fix CTAS to make combine-on-insert configurable
> ---
>
> Key: HUDI-5684
> URL: https://issues.apache.org/jira/browse/HUDI-5684
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Currently, CTAS sets `COMBINE_ON_INSERT` config value whenever target table 
> has pre-combine key specified.
> However, it's currently done in a way that doesn't allow it to be overridden 
> by user-provided configuration. We need to address that



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5684) Fix CTAS to make combine-on-insert configurable

2023-02-01 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5684:
-

 Summary: Fix CTAS to make combine-on-insert configurable
 Key: HUDI-5684
 URL: https://issues.apache.org/jira/browse/HUDI-5684
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
 Fix For: 0.13.0


Currently, CTAS sets `COMBINE_ON_INSERT` config value whenever target table has 
pre-combine key specified.

However, it's currently done in a way that doesn't allow it to be overridden by 
user-provided configuration. We need to address that



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated (7064c380506 -> e93fbeee4ac)

2023-02-01 Thread akudinkin
This is an automated email from the ASF dual-hosted git repository.

akudinkin pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 7064c380506 [MINOR] Restoring existing behavior for `DeltaStreamer` 
Incremental Source (#7810)
 add e93fbeee4ac [HUDI-5681] Fixing Kryo being instantiated w/ invalid 
`SparkConf` (#7821)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/hudi/SerDeUtils.scala | 44 --
 .../hudi/command/MergeIntoHoodieTableCommand.scala |  8 ++--
 .../hudi/command/payload/ExpressionPayload.scala   | 54 --
 3 files changed, 54 insertions(+), 52 deletions(-)
 delete mode 100644 
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/SerDeUtils.scala



[GitHub] [hudi] alexeykudinkin merged pull request #7821: [HUDI-5681] Fixing Kryo being instantiated w/ invalid `SparkConf`

2023-02-01 Thread via GitHub


alexeykudinkin merged PR #7821:
URL: https://github.com/apache/hudi/pull/7821


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7821: [HUDI-5681] Fixing Kryo being instantiated w/ invalid `SparkConf`

2023-02-01 Thread via GitHub


alexeykudinkin commented on code in PR #7821:
URL: https://github.com/apache/hudi/pull/7821#discussion_r1094049841


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala:
##
@@ -455,5 +456,50 @@ object ExpressionPayload {
 field.schema, field.doc, field.defaultVal, field.order))
 Schema.createRecord(a.getName, a.getDoc, a.getNamespace, a.isError, 
mergedFields.asJava)
   }
+
+
+  /**
+   * This object differs from Hudi's generic [[SerializationUtils]] in its 
ability to serialize
+   * Spark's internal structures (various [[Expression]]s)
+   *
+   * For that purpose we re-use Spark's [[KryoSerializer]] instance sharing 
configuration
+   * with enclosing [[SparkEnv]]. This is necessary to make sure that this 
particular instance of Kryo
+   * user for serialization of Spark's internal structures (like 
[[Expression]]s) is configured
+   * appropriately (class-loading, custom serializers, etc)
+   *
+   * TODO rebase on Spark's SerializerSupport
+   */
+  private[hudi] object Serializer {
+

Review Comment:
   Checked Spark 3.1, 3.2 and 3.3, working fine



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] liaotian1005 commented on pull request #7633: Fix Deletes issued without any prior commits

2023-02-01 Thread via GitHub


liaotian1005 commented on PR #7633:
URL: https://github.com/apache/hudi/pull/7633#issuecomment-1413177847

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7825: [DNM] Fixing deduplication in Bulk Insert row-writing path

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7825:
URL: https://github.com/apache/hudi/pull/7825#issuecomment-1413165542

   
   ## CI report:
   
   * db6d970776ab0dcf2f5b29c54137bd0796202077 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14860)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7813: [MINOR][Stacked on 7821] Fix CTAS and Insert Into to avoid combine-on-insert by default

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7813:
URL: https://github.com/apache/hudi/pull/7813#issuecomment-1413165478

   
   ## CI report:
   
   * 49ddad424eff8fc009fc3f698d9bce7de3d5ccbe UNKNOWN
   * a7104faad440c94bfae085857cd583ade8fd8e46 UNKNOWN
   * cab28490849cfc1288af59b77bd8986b58346782 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14857)
 
   * a3b0274bd9dc76d29d1a169127d3e13ea89fd23c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14859)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7813: [MINOR][Stacked on 7821] Fix CTAS and Insert Into to avoid combine-on-insert by default

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7813:
URL: https://github.com/apache/hudi/pull/7813#issuecomment-1413159520

   
   ## CI report:
   
   * 4f2eef73eae310a70c0b3c4f142c98808e6e8030 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14856)
 
   * 49ddad424eff8fc009fc3f698d9bce7de3d5ccbe UNKNOWN
   * a7104faad440c94bfae085857cd583ade8fd8e46 UNKNOWN
   * cab28490849cfc1288af59b77bd8986b58346782 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14857)
 
   * a3b0274bd9dc76d29d1a169127d3e13ea89fd23c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7825: [DNM] Fixing deduplication in Bulk Insert row-writing path

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7825:
URL: https://github.com/apache/hudi/pull/7825#issuecomment-1413159606

   
   ## CI report:
   
   * db6d970776ab0dcf2f5b29c54137bd0796202077 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin opened a new pull request, #7825: [DNM] Fixing deduplication in Bulk Insert row-writing path

2023-02-01 Thread via GitHub


alexeykudinkin opened a new pull request, #7825:
URL: https://github.com/apache/hudi/pull/7825

   ### Change Logs
   
   TBA
   
   ### Impact
   
   TBA
   
   ### Risk level (write none, low medium or high below)
   
   Low
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] TengHuo commented on issue #7691: [SUPPORT] Flink's schema conflicts with spark's schema.

2023-02-01 Thread via GitHub


TengHuo commented on issue #7691:
URL: https://github.com/apache/hudi/issues/7691#issuecomment-1413118913

   > > @TengHuo @danny0405 What's the followup here? If it's a bug, is it going 
to be fixed in 0.13.0?
   > > I fixed it in this pr #7694
   
   @LinMingQiang The PR should fix the compatibility check error only. When 
resolving Avro schema, the generator will compare the full name of schema if 
the type is `FIXED`, `ENUM`, `ARRAY`, `MAP`, `RECORD` and `UNION`. If the full 
name is not matched, the generator may return a `Symbol.error`. Then it will be 
throw as an `AvroTypeException` in `ResolvingDecoder#doAction`.
   
   Code reference: 
`org.apache.avro.io.parsing.ResolvingGrammarGenerator#generate(Schema writer, 
Schema reader, Map seen)`
   
   ```java
 public Symbol generate(Schema writer, Schema reader,
   Map seen) throws IOException
 {
   final Schema.Type writerType = writer.getType();
   final Schema.Type readerType = reader.getType();
   
   if (writerType == readerType) {
 switch (writerType) {
 case NULL:
   return Symbol.NULL;
 case BOOLEAN:
   return Symbol.BOOLEAN;
 case INT:
   return Symbol.INT;
 case LONG:
   return Symbol.LONG;
 case FLOAT:
   return Symbol.FLOAT;
 case DOUBLE:
   return Symbol.DOUBLE;
 case STRING:
   return Symbol.STRING;
 case BYTES:
   return Symbol.BYTES;
 case FIXED:
   if (writer.getFullName().equals(reader.getFullName())
   && writer.getFixedSize() == reader.getFixedSize()) {
 return Symbol.seq(Symbol.intCheckAction(writer.getFixedSize()),
 Symbol.FIXED);
   }
   break;
   
 case ENUM:
   if (writer.getFullName() == null
   || writer.getFullName().equals(reader.getFullName())) {
 return Symbol.seq(mkEnumAdjust(writer.getEnumSymbols(),
 reader.getEnumSymbols()), Symbol.ENUM);
   }
   break;
   
 case ARRAY:
   return Symbol.seq(Symbol.repeat(Symbol.ARRAY_END,
   generate(writer.getElementType(),
   reader.getElementType(), seen)),
   Symbol.ARRAY_START);
   
 case MAP:
   return Symbol.seq(Symbol.repeat(Symbol.MAP_END,
   generate(writer.getValueType(),
   reader.getValueType(), seen), Symbol.STRING),
   Symbol.MAP_START);
 case RECORD:
   return resolveRecords(writer, reader, seen);
 case UNION:
   return resolveUnion(writer, reader, seen);
 default:
   throw new AvroTypeException("Unkown type for schema: " + writerType);
 }
   } else {  // writer and reader are of different types
 if (writerType == Schema.Type.UNION) {
   return resolveUnion(writer, reader, seen);
 }
   
 switch (readerType) {
 case LONG:
   switch (writerType) {
   case INT:
 return Symbol.resolve(super.generate(writer, seen), Symbol.LONG);
   }
   break;
   
 case FLOAT:
   switch (writerType) {
   case INT:
   case LONG:
 return Symbol.resolve(super.generate(writer, seen), Symbol.FLOAT);
   }
   break;
   
 case DOUBLE:
   switch (writerType) {
   case INT:
   case LONG:
   case FLOAT:
 return Symbol.resolve(super.generate(writer, seen), Symbol.DOUBLE);
   }
   break;
   
 case BYTES:
   switch (writerType) {
   case STRING:
 return Symbol.resolve(super.generate(writer, seen), Symbol.BYTES);
   }
   break;
   
 case STRING:
   switch (writerType) {
   case BYTES:
 return Symbol.resolve(super.generate(writer, seen), Symbol.STRING);
   }
   break;
   
 case UNION:
   int j = bestBranch(reader, writer, seen);
   if (j >= 0) {
 Symbol s = generate(writer, reader.getTypes().get(j), seen);
 return Symbol.seq(Symbol.unionAdjustAction(j, s), Symbol.UNION);
   }
   break;
 case NULL:
 case BOOLEAN:
 case INT:
 case ENUM:
 case ARRAY:
 case MAP:
 case RECORD:
 case FIXED:
   break;
 default:
   throw new RuntimeException("Unexpected schema type: " + readerType);
 }
   }
   return Symbol.error("Found " + writer.getFullName()
   + ", expecting " + reader.getFullName());
 }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 

[GitHub] [hudi] Leoyzen opened a new issue, #7824: [SUPPORT] NPE occurs when enabling metadata on table which does'nt has metadata previously.

2023-02-01 Thread via GitHub


Leoyzen opened a new issue, #7824:
URL: https://github.com/apache/hudi/issues/7824

   **Describe the problem you faced**
   When enabling metadata on a table which don't has metadata previously, NPE 
occurs.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. setup a table with metadata disabled.
   2. restart the job and enabling the metadata option.
   3. NPE occurs
   
   **Expected behavior**
   
   No NPE
   
   **Environment Description**
   
   * Hudi version : 0.13.0-rc1
   
   * Spark version : N/A
   
   * Hive version : 3.1.2
   
   * Hadoop version : 3.1.3
   
   * Storage (HDFS/S3/GCS..) : OSS
   
   * Running on Docker? (yes/no) : yes, HA VVP
   
   
   **Additional context**
   
   
   **Stacktrace**
   
   ```LOG
   023-02-01 00:36:05,435 WARN  
org.apache.hudi.metadata.HoodieBackedTableMetadata   [] - Metadata 
table was not found at path 
oss://dengine-lake-zjk/cloudcode_pre/tmp_hudi_hive_test/.hoodie/metadata
   2023-02-01 00:36:05,436 INFO  
org.apache.hudi.common.table.view.FileSystemViewManager  [] - Creating View 
Manager with storage type :REMOTE_FIRST
   2023-02-01 00:36:05,436 INFO  
org.apache.hudi.common.table.view.FileSystemViewManager  [] - Creating 
remote first table view
   2023-02-01 00:36:05,438 INFO  
org.apache.hudi.client.transaction.lock.LockManager  [] - LockProvider 
org.apache.hudi.client.transaction.lock.FileSystemBasedLockProvider
   2023-02-01 00:36:05,618 INFO  
org.apache.hudi.common.table.HoodieTableMetaClient   [] - Loading 
HoodieTableMetaClient from 
oss://dengine-lake-zjk/cloudcode_pre/tmp_hudi_hive_test
   2023-02-01 00:36:05,635 INFO  org.apache.hudi.common.table.HoodieTableConfig 
  [] - Loading table properties from 
oss://dengine-lake-zjk/cloudcode_pre/tmp_hudi_hive_test/.hoodie/hoodie.properties
   2023-02-01 00:36:05,647 INFO  
org.apache.hudi.common.table.HoodieTableMetaClient   [] - Finished 
Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=PARQUET) from 
oss://dengine-lake-zjk/cloudcode_pre/tmp_hudi_hive_test
   2023-02-01 00:36:05,776 INFO  
org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Loaded 
instants upto : Option{val=[==>20230201003302911__deltacommit__INFLIGHT]}
   2023-02-01 00:36:05,778 WARN  
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter [] - Cannot 
initialize metadata table as operation(s) are in progress on the dataset: 
[[==>20230131214246466__compaction__INFLIGHT], 
[==>20230131214554966__compaction__REQUESTED], 
[==>20230131215522662__compaction__REQUESTED], 
[==>20230131220449842__compaction__REQUESTED], 
[==>20230201000246991__compaction__REQUESTED], 
[==>20230201001718618__rollback__INFLIGHT], 
[==>20230201003302911__deltacommit__INFLIGHT]]
   2023-02-01 00:36:05,778 INFO  
org.apache.hudi.common.table.HoodieTableMetaClient   [] - Loading 
HoodieTableMetaClient from 
oss://dengine-lake-zjk/cloudcode_pre/tmp_hudi_hive_test
   2023-02-01 00:36:05,789 INFO  org.apache.hudi.common.table.HoodieTableConfig 
  [] - Loading table properties from 
oss://dengine-lake-zjk/cloudcode_pre/tmp_hudi_hive_test/.hoodie/hoodie.properties
   2023-02-01 00:36:05,798 INFO  
org.apache.hudi.common.table.HoodieTableMetaClient   [] - Finished 
Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=PARQUET) from 
oss://dengine-lake-zjk/cloudcode_pre/tmp_hudi_hive_test
   2023-02-01 00:36:05,799 INFO  
org.apache.hudi.common.table.HoodieTableMetaClient   [] - Loading 
HoodieTableMetaClient from 
oss://dengine-lake-zjk/cloudcode_pre/tmp_hudi_hive_test/.hoodie/metadata
   2023-02-01 00:36:05,814 WARN  
org.apache.hudi.metadata.HoodieBackedTableMetadata   [] - Metadata 
table was not found at path 
oss://dengine-lake-zjk/cloudcode_pre/tmp_hudi_hive_test/.hoodie/metadata
   2023-02-01 00:36:05,938 INFO  org.apache.hadoop.hive.conf.HiveConf   
  [] - Found configuration file 
jar:file:../usrlib/ververica-connector-hudi-1.15-vvr-6.0-hive312-0.13.0-rc1-SNAPSHOT-jar-with-dependencies-20230131234928.jar!/hive-site.xml
   2023-02-01 00:36:06,099 WARN  org.apache.hadoop.hive.conf.HiveConf   
  [] - HiveConf of name 
hive.dummyparam.test.server.specific.config.override does not exist
   2023-02-01 00:36:06,100 WARN  org.apache.hadoop.hive.conf.HiveConf   
  [] - HiveConf of name 
hive.dummyparam.test.server.specific.config.metastoresite does not exist
   
   
   ...
   
   
   
   
   
   2023-02-01 00:36:24,740 INFO  
org.apache.hudi.sink.StreamWriteOperatorCoordinator  [] - Executor 
executes action [handle write metadata event for instant ] success!
   2023-02-01 00:36:24,800 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Sink: 
compact_commit (1/1) #0 (b732feaa948ea68f9bf1c0df9689f8f4) switched from 
INITIALIZING to RUNNING.
   2023-02-01 00:36:24,886 INFO  

[GitHub] [hudi] hudi-bot commented on pull request #7159: [HUDI-5173]Skip if there is only one file in clusteringGroup

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7159:
URL: https://github.com/apache/hudi/pull/7159#issuecomment-1413113864

   
   ## CI report:
   
   * 15ecd91180d32c7fa1905c11408f4bc23347e682 UNKNOWN
   * f144027d86cb2fad74a0a4a175e27204dacec8d3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14855)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7821: [HUDI-5681] Fixing Kryo being instantiated w/ invalid `SparkConf`

2023-02-01 Thread via GitHub


alexeykudinkin commented on code in PR #7821:
URL: https://github.com/apache/hudi/pull/7821#discussion_r1093983753


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala:
##
@@ -328,7 +328,7 @@ case class MergeIntoHoodieTableCommand(mergeInto: 
MergeIntoTable) extends Hoodie
   }).toMap
 // Serialize the Map[UpdateCondition, UpdateAssignments] to base64 string
 val serializedUpdateConditionAndExpressions = Base64.getEncoder
-  .encodeToString(SerDeUtils.toBytes(updateConditionToAssignments))
+  .encodeToString(Serializer.toBytes(updateConditionToAssignments))

Review Comment:
   What exactly are you referring to?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7821: [HUDI-5681] Fixing Kryo being instantiated w/ invalid `SparkConf`

2023-02-01 Thread via GitHub


alexeykudinkin commented on code in PR #7821:
URL: https://github.com/apache/hudi/pull/7821#discussion_r1093983668


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala:
##
@@ -455,5 +456,50 @@ object ExpressionPayload {
 field.schema, field.doc, field.defaultVal, field.order))
 Schema.createRecord(a.getName, a.getDoc, a.getNamespace, a.isError, 
mergedFields.asJava)
   }
+
+
+  /**
+   * This object differs from Hudi's generic [[SerializationUtils]] in its 
ability to serialize
+   * Spark's internal structures (various [[Expression]]s)
+   *
+   * For that purpose we re-use Spark's [[KryoSerializer]] instance sharing 
configuration
+   * with enclosing [[SparkEnv]]. This is necessary to make sure that this 
particular instance of Kryo
+   * user for serialization of Spark's internal structures (like 
[[Expression]]s) is configured
+   * appropriately (class-loading, custom serializers, etc)
+   *
+   * TODO rebase on Spark's SerializerSupport
+   */
+  private[hudi] object Serializer {
+

Review Comment:
   Will check



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7816: [HUDI-5676] Fix BigQuerySyncTool standalone mode

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7816:
URL: https://github.com/apache/hudi/pull/7816#issuecomment-1413110728

   
   ## CI report:
   
   * 838d7b43b55595c23b9e71b4abea0c40215fd7cd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14849)
 
   * eba58cdddcd5a83b5843bd8da41ba43b45435210 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14858)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7813: [MINOR][Stacked on 7821] Fix CTAS and Insert Into to avoid combine-on-insert by default

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7813:
URL: https://github.com/apache/hudi/pull/7813#issuecomment-1413110694

   
   ## CI report:
   
   * 4f2eef73eae310a70c0b3c4f142c98808e6e8030 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14856)
 
   * 49ddad424eff8fc009fc3f698d9bce7de3d5ccbe UNKNOWN
   * a7104faad440c94bfae085857cd583ade8fd8e46 UNKNOWN
   * cab28490849cfc1288af59b77bd8986b58346782 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14857)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-5683) Package the Flink release bundle jar with hive profile

2023-02-01 Thread Danny Chen (Jira)
Danny Chen created HUDI-5683:


 Summary: Package the Flink release bundle jar with hive profile
 Key: HUDI-5683
 URL: https://issues.apache.org/jira/browse/HUDI-5683
 Project: Apache Hudi
  Issue Type: Improvement
  Components: flink
Reporter: Danny Chen
Assignee: Danny Chen
 Fix For: 0.13.1






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #7816: [HUDI-5676] Fix BigQuerySyncTool standalone mode

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7816:
URL: https://github.com/apache/hudi/pull/7816#issuecomment-1413107506

   
   ## CI report:
   
   * 838d7b43b55595c23b9e71b4abea0c40215fd7cd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14849)
 
   * eba58cdddcd5a83b5843bd8da41ba43b45435210 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7813: [MINOR][Stacked on 7821] Fix CTAS and Insert Into to avoid combine-on-insert by default

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7813:
URL: https://github.com/apache/hudi/pull/7813#issuecomment-1413107450

   
   ## CI report:
   
   * 4f2eef73eae310a70c0b3c4f142c98808e6e8030 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14856)
 
   * 49ddad424eff8fc009fc3f698d9bce7de3d5ccbe UNKNOWN
   * a7104faad440c94bfae085857cd583ade8fd8e46 UNKNOWN
   * cab28490849cfc1288af59b77bd8986b58346782 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7818: [HUDI-5678] Fix `deduceShuffleParallelism` in row-writing Bulk Insert helper

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7818:
URL: https://github.com/apache/hudi/pull/7818#issuecomment-1413100962

   
   ## CI report:
   
   * a8d6ee126a834ef2133ba0cbe56898ce98e0cb43 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14854)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-5682) Bucket index does not work correctly for multi-writer scenarios

2023-02-01 Thread Danny Chen (Jira)
Danny Chen created HUDI-5682:


 Summary: Bucket index does not work correctly for multi-writer 
scenarios
 Key: HUDI-5682
 URL: https://issues.apache.org/jira/browse/HUDI-5682
 Project: Apache Hudi
  Issue Type: Bug
  Components: core
Reporter: Danny Chen
Assignee: Danny Chen
 Fix For: 0.13.1






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] Leoyzen opened a new issue, #7823: [SUPPORT]No execution calls after rollback compaction while using offline flink compactor.

2023-02-01 Thread via GitHub


Leoyzen opened a new issue, #7823:
URL: https://github.com/apache/hudi/issues/7823

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   While using offline flink compactor, "no execute() calls" exception occurs 
after rollback previous compaction inflight.
   
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. There are compaction plans scheduled, and some task unfinished(INFLIGHT).
   2. Launch HoodieFlinkCompactor.
   3. The error occurs
   
   **Expected behavior**
   
   After rollback compaction infight, the task should be execute.
   
   **Environment Description**
   
   * Hudi version :
   
   0.13.0-rc1
   
   * Spark version :
   
   * Hive version :
   3.1.2
   
   * Hadoop version :
   3.1.3
   
   * Storage (HDFS/S3/GCS..) :
   OSS
   
   * Running on Docker? (yes/no) :
   yes, HA Cluster.
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```LOG
   2023-02-01 22:53:12,742 WARN  
org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor [] - Rollback 
finished without deleting inflight instant file. 
Instant=[==>20230201175930385__compaction__INFLIGHT]
   2023-02-01 22:53:12,743 INFO  
org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Checking for 
file exists 
?oss://dengine-lake-zjk/cloudcode_prod/dwd_egc_adv_resp_intra/.hoodie/20230201225312338.rollback.inflight
   2023-02-01 22:53:12,764 INFO  
org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Create new 
file for toInstant 
?oss://dengine-lake-zjk/cloudcode_prod/dwd_egc_adv_resp_intra/.hoodie/20230201225312338.rollback
   2023-02-01 22:53:12,765 INFO  
org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor [] - Rollback 
of Commits [20230201175930385] is complete
   2023-02-01 22:53:12,772 INFO  
org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Deleting 
instant [==>20230201175930385__compaction__INFLIGHT]
   2023-02-01 22:53:12,787 INFO  
org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Removed 
instant [==>20230201175930385__compaction__INFLIGHT]
   2023-02-01 22:53:12,957 INFO  
org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Loaded 
instants upto : Option{val=[20230201225312338__rollback__COMPLETED]}
   2023-02-01 22:53:12,957 INFO  org.apache.hudi.client.RunsTableService
  [] - Rollback inflight compaction instant: [20230201175424452]
   2023-02-01 22:53:13,160 INFO  
org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Loaded 
instants upto : Option{val=[==>20230201225312957__rollback__REQUESTED]}
   2023-02-01 22:53:13,160 INFO  
org.apache.hudi.table.action.rollback.BaseRollbackPlanActionExecutor [] - 
Requesting Rollback with instant time 
[==>20230201225312957__rollback__REQUESTED]
   2023-02-01 22:53:13,322 INFO  
org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Loaded 
instants upto : Option{val=[==>20230201225312957__rollback__REQUESTED]}
   2023-02-01 22:53:13,336 INFO  
org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Checking for 
file exists 
?oss://dengine-lake-zjk/cloudcode_prod/dwd_egc_adv_resp_intra/.hoodie/20230201225312957.rollback.requested
   2023-02-01 22:53:13,360 INFO  
org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Create new 
file for toInstant 
?oss://dengine-lake-zjk/cloudcode_prod/dwd_egc_adv_resp_intra/.hoodie/20230201225312957.rollback.inflight
   2023-02-01 22:53:13,360 INFO  
org.apache.hudi.table.action.rollback.MergeOnReadRollbackActionExecutor [] - 
Rolling back instant [==>20230201175424452__compaction__INFLIGHT]
   2023-02-01 22:53:13,360 INFO  
org.apache.hudi.table.action.rollback.MergeOnReadRollbackActionExecutor [] - 
Unpublished [==>20230201175424452__compaction__INFLIGHT]
   2023-02-01 22:53:13,360 INFO  
org.apache.hudi.table.action.rollback.MergeOnReadRollbackActionExecutor [] - 
Time(in ms) taken to finish rollback 0
   2023-02-01 22:53:13,360 INFO  
org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor [] - Rolled 
back inflight instant 20230201175424452
   2023-02-01 22:53:13,360 WARN  
org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor [] - Rollback 
finished without deleting inflight instant file. 
Instant=[==>20230201175424452__compaction__INFLIGHT]
   2023-02-01 22:53:13,361 INFO  
org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Checking for 
file exists 
?oss://dengine-lake-zjk/cloudcode_prod/dwd_egc_adv_resp_intra/.hoodie/20230201225312957.rollback.inflight
   2023-02-01 22:53:13,383 INFO  

[GitHub] [hudi] codope commented on pull request #7796: [HUDI-5647] Automate savepoint and restore tests

2023-02-01 Thread via GitHub


codope commented on PR #7796:
URL: https://github.com/apache/hudi/pull/7796#issuecomment-1413089788

   @danny0405 Can you please look into the test failure?
   ```
   - Test Call run_clustering Procedure By Table *** FAILED ***
 Expected Array([20230130151128302,3,REQUESTED,*], 
[20230130151131957,1,REQUESTED,*]), but got 
Array([20230130151128302,3,REQUESTED,*]) (HoodieSparkSqlTestBase.scala:106)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] eric9204 opened a new issue, #7822: [SUPPORT][CDC]UnresolvedUnionException: Not in union ["null","double"]: 20230202105806923_0_1

2023-02-01 Thread via GitHub


eric9204 opened a new issue, #7822:
URL: https://github.com/apache/hudi/issues/7822

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   enable CDC, cannot perform compaction table service. 
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.hoodie.table.cdc.enabled=true
  hoodie.table.cdc.supplemental.logging.mode=data_before_after
   2.table type: mor
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : master
   
   * Spark version : 3.1.1
   
   * Hive version : 3.1.2
   
   * Hadoop version :none
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```
   23/02/02 10:58:21 ERROR HoodieStreamingSink: Micro batch id=1 threw 
following expections,aborting streaming app to avoid data loss: 
   org.apache.hudi.exception.HoodieCompactionException: Could not compact 
/tmp/hudi/cdc_test
at 
org.apache.hudi.table.action.compact.RunCompactionActionExecutor.execute(RunCompactionActionExecutor.java:116)
at 
org.apache.hudi.table.HoodieSparkMergeOnReadTable.compact(HoodieSparkMergeOnReadTable.java:140)
at 
org.apache.hudi.client.SparkRDDTableServiceClient.compact(SparkRDDTableServiceClient.java:75)
at 
org.apache.hudi.client.BaseHoodieTableServiceClient.lambda$runAnyPendingCompactions$2(BaseHoodieTableServiceClient.java:191)
at java.util.ArrayList.forEach(ArrayList.java:1259)
at 
org.apache.hudi.client.BaseHoodieTableServiceClient.runAnyPendingCompactions(BaseHoodieTableServiceClient.java:189)
at 
org.apache.hudi.client.BaseHoodieTableServiceClient.inlineCompaction(BaseHoodieTableServiceClient.java:160)
at 
org.apache.hudi.client.BaseHoodieTableServiceClient.runTableServicesInline(BaseHoodieTableServiceClient.java:334)
at 
org.apache.hudi.client.BaseHoodieWriteClient.runTableServicesInline(BaseHoodieWriteClient.java:540)
at 
org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:249)
at 
org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:102)
at 
org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:903)
at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:372)
at 
org.apache.hudi.HoodieStreamingSink.$anonfun$addBatch$2(HoodieStreamingSink.scala:122)
at scala.util.Try$.apply(Try.scala:213)
at 
org.apache.hudi.HoodieStreamingSink.$anonfun$addBatch$1(HoodieStreamingSink.scala:120)
at 
org.apache.hudi.HoodieStreamingSink.retry(HoodieStreamingSink.scala:244)
at 
org.apache.hudi.HoodieStreamingSink.addBatch(HoodieStreamingSink.scala:119)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:586)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$15(MicroBatchExecution.scala:584)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:584)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:226)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
at 

[GitHub] [hudi] KnightChess closed pull request #7119: [HUDI-5149] fix spark single file sort plan can not work

2023-02-01 Thread via GitHub


KnightChess closed pull request #7119: [HUDI-5149] fix spark single file sort 
plan can not work
URL: https://github.com/apache/hudi/pull/7119


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zhmingyong commented on issue #6021: [SUPPORT] flink write rollback error, [Cannot use marker based rollback strategy on completed instant ]

2023-02-01 Thread via GitHub


zhmingyong commented on issue #6021:
URL: https://github.com/apache/hudi/issues/6021#issuecomment-1413075646

   > How did you solve the problem, I also encountered the same problem.
   
   Caused by: java.lang.IllegalArgumentException: Cannot use marker based 
rollback strategy on completed instant:[20230103022823207__commit__COMPLETED]
   at 
org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)
   at 
org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.(BaseRollbackActionExecutor.java:90)
   at 
org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.(BaseRollbackActionExecutor.java:71)
   at 
org.apache.hudi.table.action.rollback.CopyOnWriteRollbackActionExecutor.(CopyOnWriteRollbackActionExecutor.java:48)
   at 
org.apache.hudi.table.HoodieSparkCopyOnWriteTable.rollback(HoodieSparkCopyOnWriteTable.java:343)
   at 
org.apache.hudi.client.AbstractHoodieWriteClient.rollback(AbstractHoodieWriteClient.java:640)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zhmingyong commented on issue #6021: [SUPPORT] flink write rollback error, [Cannot use marker based rollback strategy on completed instant ]

2023-02-01 Thread via GitHub


zhmingyong commented on issue #6021:
URL: https://github.com/apache/hudi/issues/6021#issuecomment-1413075020

   How did you solve the problem, I also encountered the same problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7813: [MINOR][Stacked on 7821] Fix CTAS and Insert Into to avoid combine-on-insert by default

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7813:
URL: https://github.com/apache/hudi/pull/7813#issuecomment-1413055790

   
   ## CI report:
   
   * 4f2eef73eae310a70c0b3c4f142c98808e6e8030 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14856)
 
   * 49ddad424eff8fc009fc3f698d9bce7de3d5ccbe UNKNOWN
   * a7104faad440c94bfae085857cd583ade8fd8e46 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7813: [MINOR][Stacked on 7821] Fix CTAS and Insert Into to avoid combine-on-insert by default

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7813:
URL: https://github.com/apache/hudi/pull/7813#issuecomment-1413050960

   
   ## CI report:
   
   * 4f2eef73eae310a70c0b3c4f142c98808e6e8030 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14856)
 
   * 49ddad424eff8fc009fc3f698d9bce7de3d5ccbe UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #7778: [SUPPORT] NPE in spark structured streaming reading hudi table from checkpoint that no longer exists

2023-02-01 Thread via GitHub


nsivabalan commented on issue #7778:
URL: https://github.com/apache/hudi/issues/7778#issuecomment-1413050154

   thanks @kazdy. will leave it open. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7159: [HUDI-5173]Skip if there is only one file in clusteringGroup

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7159:
URL: https://github.com/apache/hudi/pull/7159#issuecomment-1413050063

   
   ## CI report:
   
   * 15ecd91180d32c7fa1905c11408f4bc23347e682 UNKNOWN
   * f144027d86cb2fad74a0a4a175e27204dacec8d3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14855)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7821: [HUDI-5681] Fixing Kryo being instantiated w/ invalid `SparkConf`

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7821:
URL: https://github.com/apache/hudi/pull/7821#issuecomment-1413044543

   
   ## CI report:
   
   * e99119a0314f601d87668aeac8b048730415c919 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14853)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7813: [MINOR][Stacked on 7821] Fix CTAS and Insert Into to avoid combine-on-insert by default

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7813:
URL: https://github.com/apache/hudi/pull/7813#issuecomment-1413044467

   
   ## CI report:
   
   * bd427884d0d57f86eeb0260a5bc0f606fb72cb19 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14834)
 
   * 4f2eef73eae310a70c0b3c4f142c98808e6e8030 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] slfan1989 commented on pull request #7809: [HUDI-5664] Improve SqlQueryPreCommitValidator#queries Parallelism.

2023-02-01 Thread via GitHub


slfan1989 commented on PR #7809:
URL: https://github.com/apache/hudi/pull/7809#issuecomment-1413038747

   @codope Can you help review this pr? Thank you very much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-5561) The preCombine method of PartialUpdateAvroPayload is not called

2023-02-01 Thread xi chaomin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xi chaomin closed HUDI-5561.

Resolution: Duplicate

> The preCombine method of PartialUpdateAvroPayload is not called
> ---
>
> Key: HUDI-5561
> URL: https://issues.apache.org/jira/browse/HUDI-5561
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: xi chaomin
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] xicm closed pull request #7675: [HUDI-5561] The preCombine method of PartialUpdateAvroPayload is not called

2023-02-01 Thread via GitHub


xicm closed pull request #7675: [HUDI-5561] The preCombine method of 
PartialUpdateAvroPayload is not called
URL: https://github.com/apache/hudi/pull/7675


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #7821: [HUDI-5681] Fixing Kryo being instantiated w/ invalid `SparkConf`

2023-02-01 Thread via GitHub


yihua commented on code in PR #7821:
URL: https://github.com/apache/hudi/pull/7821#discussion_r1093906025


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala:
##
@@ -328,7 +328,7 @@ case class MergeIntoHoodieTableCommand(mergeInto: 
MergeIntoTable) extends Hoodie
   }).toMap
 // Serialize the Map[UpdateCondition, UpdateAssignments] to base64 string
 val serializedUpdateConditionAndExpressions = Base64.getEncoder
-  .encodeToString(SerDeUtils.toBytes(updateConditionToAssignments))
+  .encodeToString(Serializer.toBytes(updateConditionToAssignments))

Review Comment:
   Does this work for all Spark versions?



##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala:
##
@@ -455,5 +456,50 @@ object ExpressionPayload {
 field.schema, field.doc, field.defaultVal, field.order))
 Schema.createRecord(a.getName, a.getDoc, a.getNamespace, a.isError, 
mergedFields.asJava)
   }
+
+
+  /**
+   * This object differs from Hudi's generic [[SerializationUtils]] in its 
ability to serialize
+   * Spark's internal structures (various [[Expression]]s)
+   *
+   * For that purpose we re-use Spark's [[KryoSerializer]] instance sharing 
configuration
+   * with enclosing [[SparkEnv]]. This is necessary to make sure that this 
particular instance of Kryo
+   * user for serialization of Spark's internal structures (like 
[[Expression]]s) is configured
+   * appropriately (class-loading, custom serializers, etc)
+   *
+   * TODO rebase on Spark's SerializerSupport
+   */
+  private[hudi] object Serializer {
+

Review Comment:
   Have you tested this on all Spark versions (Spark 2.4, 3.1, 3.2, 3.3) in 
cluster environment (multiple nodes)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7818: [HUDI-5678] deduceShuffleParallelism Returns 0

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7818:
URL: https://github.com/apache/hudi/pull/7818#issuecomment-1412983119

   
   ## CI report:
   
   * 711e13f7eb54f36c755e6a57966c228df161bd0c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14850)
 
   * a8d6ee126a834ef2133ba0cbe56898ce98e0cb43 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14854)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7159: [HUDI-5173]Skip if there is only one file in clusteringGroup

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7159:
URL: https://github.com/apache/hudi/pull/7159#issuecomment-1412982086

   
   ## CI report:
   
   * 15ecd91180d32c7fa1905c11408f4bc23347e682 UNKNOWN
   * f144027d86cb2fad74a0a4a175e27204dacec8d3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7818: [HUDI-5678] deduceShuffleParallelism Returns 0

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7818:
URL: https://github.com/apache/hudi/pull/7818#issuecomment-1412977246

   
   ## CI report:
   
   * 711e13f7eb54f36c755e6a57966c228df161bd0c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14850)
 
   * a8d6ee126a834ef2133ba0cbe56898ce98e0cb43 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7816: [HUDI-5676] Fix BigQuerySyncTool standalone mode

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7816:
URL: https://github.com/apache/hudi/pull/7816#issuecomment-1412961868

   
   ## CI report:
   
   * 838d7b43b55595c23b9e71b4abea0c40215fd7cd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14849)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] rahil-c commented on a diff in pull request #7819: [HUDI-5652] Add hudi-cli-bundle docs

2023-02-01 Thread via GitHub


rahil-c commented on code in PR #7819:
URL: https://github.com/apache/hudi/pull/7819#discussion_r1093844152


##
website/docs/cli.md:
##
@@ -5,10 +5,22 @@ last_modified_at: 2021-08-18T15:59:57-04:00
 ---
 
 ### Local set up
-Once hudi has been built, the shell can be fired by via  `cd hudi-cli && 
./hudi-cli.sh`. A hudi table resides on DFS, in a location referred to as the 
`basePath` and
+Once hudi has been built, the shell can be fired by via  `cd hudi-cli && 
./hudi-cli.sh`.
+
+Optionally in release `0.13.0` we have now added another way of launching the 
`hudi cli`, which is using the `hudi-cli-bundle`.
+There are a couple of requirements when using this approach such as having 
`spark` installed locally on your machine. 
+It is required to use a spark distribution with hadoop dependencies packaged 
such as `spark-3.3.1-bin-hadoop2.tgz` from 
https://archive.apache.org/dist/spark/.
+We also recommend you set an env variable `$SPARK_HOME` to the path of where 
spark is installed on your machine. 
+One important thing to note is that the `hudi-spark-bundle` should also be 
present when using the `hudi-cli-bundle`.  

Review Comment:
   Good question, ideally the cli bundle and spark bundle should be inferred 
based on the logic of the script. 
https://github.com/apache/hudi/blob/master/packaging/hudi-cli-bundle/hudi-cli-with-bundle.sh#L23
   
   User can also set these env var themselves in their shell, but is not 
required. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] rahil-c commented on a diff in pull request #7819: [HUDI-5652] Add hudi-cli-bundle docs

2023-02-01 Thread via GitHub


rahil-c commented on code in PR #7819:
URL: https://github.com/apache/hudi/pull/7819#discussion_r1093844152


##
website/docs/cli.md:
##
@@ -5,10 +5,22 @@ last_modified_at: 2021-08-18T15:59:57-04:00
 ---
 
 ### Local set up
-Once hudi has been built, the shell can be fired by via  `cd hudi-cli && 
./hudi-cli.sh`. A hudi table resides on DFS, in a location referred to as the 
`basePath` and
+Once hudi has been built, the shell can be fired by via  `cd hudi-cli && 
./hudi-cli.sh`.
+
+Optionally in release `0.13.0` we have now added another way of launching the 
`hudi cli`, which is using the `hudi-cli-bundle`.
+There are a couple of requirements when using this approach such as having 
`spark` installed locally on your machine. 
+It is required to use a spark distribution with hadoop dependencies packaged 
such as `spark-3.3.1-bin-hadoop2.tgz` from 
https://archive.apache.org/dist/spark/.
+We also recommend you set an env variable `$SPARK_HOME` to the path of where 
spark is installed on your machine. 
+One important thing to note is that the `hudi-spark-bundle` should also be 
present when using the `hudi-cli-bundle`.  

Review Comment:
   Good question, ideally the cli bundle and spark bundle should be inferred 
based on the logic of the script. 
https://github.com/apache/hudi/blob/master/packaging/hudi-cli-bundle/hudi-cli-with-bundle.sh#L23
   
   User can also set these env var themselves in their shell. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7821: [HUDI-5681] Fixing Kryo being instantiated w/ invalid `SparkConf`

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7821:
URL: https://github.com/apache/hudi/pull/7821#issuecomment-1412892325

   
   ## CI report:
   
   * e99119a0314f601d87668aeac8b048730415c919 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14853)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7816: [HUDI-5676] Fix BigQuerySyncTool standalone mode

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7816:
URL: https://github.com/apache/hudi/pull/7816#issuecomment-1412877741

   
   ## CI report:
   
   * 838d7b43b55595c23b9e71b4abea0c40215fd7cd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14849)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7821: [HUDI-5681] Fixing Kryo being instantiated w/ invalid `SparkConf`

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7821:
URL: https://github.com/apache/hudi/pull/7821#issuecomment-1412877824

   
   ## CI report:
   
   * e99119a0314f601d87668aeac8b048730415c919 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #7819: [HUDI-5652] Add hudi-cli-bundle docs

2023-02-01 Thread via GitHub


nsivabalan commented on code in PR #7819:
URL: https://github.com/apache/hudi/pull/7819#discussion_r1093831535


##
website/docs/cli.md:
##
@@ -5,10 +5,22 @@ last_modified_at: 2021-08-18T15:59:57-04:00
 ---
 
 ### Local set up
-Once hudi has been built, the shell can be fired by via  `cd hudi-cli && 
./hudi-cli.sh`. A hudi table resides on DFS, in a location referred to as the 
`basePath` and
+Once hudi has been built, the shell can be fired by via  `cd hudi-cli && 
./hudi-cli.sh`.
+
+Optionally in release `0.13.0` we have now added another way of launching the 
`hudi cli`, which is using the `hudi-cli-bundle`.
+There are a couple of requirements when using this approach such as having 
`spark` installed locally on your machine. 
+It is required to use a spark distribution with hadoop dependencies packaged 
such as `spark-3.3.1-bin-hadoop2.tgz` from 
https://archive.apache.org/dist/spark/.
+We also recommend you set an env variable `$SPARK_HOME` to the path of where 
spark is installed on your machine. 
+One important thing to note is that the `hudi-spark-bundle` should also be 
present when using the `hudi-cli-bundle`.  

Review Comment:
   how do we set the path for spark-bundle? is there any other env variable 
that one needs to set? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7816: [HUDI-5676] Fix BigQuerySyncTool standalone mode

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7816:
URL: https://github.com/apache/hudi/pull/7816#issuecomment-1412863583

   
   ## CI report:
   
   * 838d7b43b55595c23b9e71b4abea0c40215fd7cd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5681) Merge Into fails while deserializing expressions

2023-02-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5681:
-
Labels: pull-request-available  (was: )

> Merge Into fails while deserializing expressions
> 
>
> Key: HUDI-5681
> URL: https://issues.apache.org/jira/browse/HUDI-5681
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> While running our benchmark suite against 0.13 RC, we've stumbled upon 
> following exceptions:
> {code:java}
> 23/02/01 08:29:01 ERROR TaskSetManager: Task 1 in stage 947.0 failed 4 times; 
> aborting job
> 2023-02-01T08:29:01.219 ERROR: merge:1:inventory
> Job aborted due to stage failure: Task 1 in stage 947.0 failed 4 times, most 
> recent failure: Lost task 1.3 in stage 947.0 (TID 101955) 
> (ip-172-31-18-9.us-west-2.compute.internal executor 140): 
> org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
> UPDATE for partition :1
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:336)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleInsertPartition(BaseSparkCommitActionExecutor.java:342)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:253)
>   at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378)
>   at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1525)
>   at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:138)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 
> org.apache.spark.sql.catalyst.expressions.Literal
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804)
>   at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:221)
>   at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:199)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:408)
>   at org.apache.spark.sql.hudi.SerDeUtils$.toObject(SerDeUtils.scala:42)
>   at 
> 

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7821: [HUDI-5681] Fixing Kryo being instantiated w/ invalid `SparkConf`

2023-02-01 Thread via GitHub


alexeykudinkin commented on code in PR #7821:
URL: https://github.com/apache/hudi/pull/7821#discussion_r1093798048


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala:
##
@@ -188,7 +188,6 @@ trait ProvidesHoodieConfig extends Logging {
 PRECOMBINE_FIELD.key -> preCombineField,
 PARTITIONPATH_FIELD.key -> partitionFieldsStr,
 PAYLOAD_CLASS_NAME.key -> payloadClassName,
-HoodieWriteConfig.COMBINE_BEFORE_INSERT.key -> 
String.valueOf(hasPrecombineColumn),

Review Comment:
   Stacked on top of another (for testing), will be cleaned up



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5681) Merge Into fails while deserializing expressions

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5681:
--
Status: In Progress  (was: Open)

> Merge Into fails while deserializing expressions
> 
>
> Key: HUDI-5681
> URL: https://issues.apache.org/jira/browse/HUDI-5681
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> While running our benchmark suite against 0.13 RC, we've stumbled upon 
> following exceptions:
> {code:java}
> 23/02/01 08:29:01 ERROR TaskSetManager: Task 1 in stage 947.0 failed 4 times; 
> aborting job
> 2023-02-01T08:29:01.219 ERROR: merge:1:inventory
> Job aborted due to stage failure: Task 1 in stage 947.0 failed 4 times, most 
> recent failure: Lost task 1.3 in stage 947.0 (TID 101955) 
> (ip-172-31-18-9.us-west-2.compute.internal executor 140): 
> org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
> UPDATE for partition :1
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:336)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleInsertPartition(BaseSparkCommitActionExecutor.java:342)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:253)
>   at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378)
>   at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1525)
>   at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:138)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 
> org.apache.spark.sql.catalyst.expressions.Literal
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804)
>   at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:221)
>   at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:199)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:408)
>   at org.apache.spark.sql.hudi.SerDeUtils$.toObject(SerDeUtils.scala:42)
>   at 
> org.apache.spark.sql.hudi.command.payload.ExpressionPayload$$anon$7.apply(ExpressionPayload.scala:423)
>   at 
> 

[jira] [Updated] (HUDI-5681) Merge Into fails while deserializing expressions

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5681:
--
Status: Patch Available  (was: In Progress)

> Merge Into fails while deserializing expressions
> 
>
> Key: HUDI-5681
> URL: https://issues.apache.org/jira/browse/HUDI-5681
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> While running our benchmark suite against 0.13 RC, we've stumbled upon 
> following exceptions:
> {code:java}
> 23/02/01 08:29:01 ERROR TaskSetManager: Task 1 in stage 947.0 failed 4 times; 
> aborting job
> 2023-02-01T08:29:01.219 ERROR: merge:1:inventory
> Job aborted due to stage failure: Task 1 in stage 947.0 failed 4 times, most 
> recent failure: Lost task 1.3 in stage 947.0 (TID 101955) 
> (ip-172-31-18-9.us-west-2.compute.internal executor 140): 
> org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
> UPDATE for partition :1
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:336)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleInsertPartition(BaseSparkCommitActionExecutor.java:342)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:253)
>   at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378)
>   at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1525)
>   at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:138)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 
> org.apache.spark.sql.catalyst.expressions.Literal
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804)
>   at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:221)
>   at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:199)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:408)
>   at org.apache.spark.sql.hudi.SerDeUtils$.toObject(SerDeUtils.scala:42)
>   at 
> org.apache.spark.sql.hudi.command.payload.ExpressionPayload$$anon$7.apply(ExpressionPayload.scala:423)
>   at 
> 

[jira] [Updated] (HUDI-4937) Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT readers

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4937:
--
Sprint: 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 
2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3  
(was: 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 
0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 
2023-01-31)

> Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT 
> readers
> -
>
> Key: HUDI-4937
> URL: https://issues.apache.org/jira/browse/HUDI-4937
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, writer-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> Currently, `HoodieTable` is holding `HoodieBackedTableMetadata` that is setup 
> not to reuse actual LogScanner and HFileReader used to read MT itself.
> This is proving to be wasteful on a number of occasions already, including 
> (not an exhaustive list):
> https://github.com/apache/hudi/issues/6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5678) deduceShuffleParallelism Returns 0 when that should never happen

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5678:
--
Status: Patch Available  (was: In Progress)

> deduceShuffleParallelism Returns 0 when that should never happen
> 
>
> Key: HUDI-5678
> URL: https://issues.apache.org/jira/browse/HUDI-5678
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jonathan Vexler
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: image (1).png
>
>
> This test 
> {code:java}
>   forAll(BulkInsertSortMode.values().toList) { (sortMode: BulkInsertSortMode) 
> =>val sortModeName = sortMode.name()test(s"Test Bulk Insert with 
> BulkInsertSortMode: '$sortModeName'") {  withTempDir { basePath =>
> testBulkInsertPartitioner(basePath, sortModeName)  }}  }
>   def testBulkInsertPartitioner(basePath: File, sortModeName: String): Unit = 
> {val tableName = generateTableName//Remove these with [HUDI-5419]
> spark.sessionState.conf.unsetConf("hoodie.datasource.write.operation")
> spark.sessionState.conf.unsetConf("hoodie.datasource.write.insert.drop.duplicates")
> 
> spark.sessionState.conf.unsetConf("hoodie.merge.allow.duplicate.on.inserts")  
>   
> spark.sessionState.conf.unsetConf("hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled")
> //Default parallelism is 200 which means in global sort, each record will 
> end up in a different spark partition so//9 files would be created. 
> Setting parallelism to 3 so that each spark partition will contain a hudi 
> partition.val parallelism = if 
> (sortModeName.equals(BulkInsertSortMode.GLOBAL_SORT.name())) {  
> "hoodie.bulkinsert.shuffle.parallelism = 3,"} else {  ""}
> spark.sql(  s""" |create table $tableName ( |  id int,
>  |  name string, |  price double, |  dt string |) 
> using hudi | tblproperties ( |  primaryKey = 'id', |  
> preCombineField = 'name', |  type = 'cow', |  $parallelism
>  |  hoodie.bulkinsert.sort.mode = '$sortModeName' | ) | 
> partitioned by (dt) | location 
> '${basePath.getCanonicalPath}/$tableName'""".stripMargin)
> spark.sql("set hoodie.sql.bulk.insert.enable = true")spark.sql("set 
> hoodie.sql.insert.mode = non-strict")spark.sql(  s"""insert into 
> $tableName  values |(5, 'a', 35, '2021-05-21'), |(1, 'a', 31, 
> '2021-01-21'), |(3, 'a', 33, '2021-03-21'), |(4, 'b', 16, 
> '2021-05-21'), |(2, 'b', 18, '2021-01-21'), |(6, 'b', 17, 
> '2021-03-21'), |(8, 'a', 21, '2021-05-21'), |(9, 'a', 22, 
> '2021-01-21'), |(7, 'a', 23, '2021-03-21') |""".stripMargin)  
>   assertResult(3)(spark.sql(s"select distinct _hoodie_file_name from 
> $tableName").count())  } {code}
> Fails due to 
> {code:java}
> requirement failed: Number of partitions (0) must be positive.
> java.lang.IllegalArgumentException: requirement failed: Number of partitions 
> (0) must be positive.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Repartition.(basicLogicalOperators.scala:951)
>   at org.apache.spark.sql.Dataset.coalesce(Dataset.scala:2946)
>   at 
> org.apache.hudi.execution.bulkinsert.PartitionSortPartitionerWithRows.repartitionRecords(PartitionSortPartitionerWithRows.java:48)
>   at 
> org.apache.hudi.execution.bulkinsert.PartitionSortPartitionerWithRows.repartitionRecords(PartitionSortPartitionerWithRows.java:34)
>   at 
> org.apache.hudi.HoodieDatasetBulkInsertHelper$.prepareForBulkInsert(HoodieDatasetBulkInsertHelper.scala:124)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.bulkInsertAsRow(HoodieSparkSqlWriter.scala:763)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:239)
>   at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:107)
>   at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand.run(InsertIntoHoodieTableCommand.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
>   at 

[jira] [Updated] (HUDI-4937) Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT readers

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4937:
--
Sprint: 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 
2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, 
Sprint 2023-02-14  (was: 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15, 
2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 
Final Sprint 3)

> Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT 
> readers
> -
>
> Key: HUDI-4937
> URL: https://issues.apache.org/jira/browse/HUDI-4937
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, writer-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> Currently, `HoodieTable` is holding `HoodieBackedTableMetadata` that is setup 
> not to reuse actual LogScanner and HFileReader used to read MT itself.
> This is proving to be wasteful on a number of occasions already, including 
> (not an exhaustive list):
> https://github.com/apache/hudi/issues/6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5633) Fixing HoodieSparkRecord performance bottlenecks

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-5633.
-
Resolution: Fixed

> Fixing HoodieSparkRecord performance bottlenecks
> 
>
> Key: HUDI-5633
> URL: https://issues.apache.org/jira/browse/HUDI-5633
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> There currently following issues w/ the current HoodieSparkRecord 
> implementation:
>  # It rewrites records using `rewriteRecord` and `rewriteRecordWithNewSchema` 
> which do Schema traversals for every record. Instead we should do schema 
> traversal only once and produce a transformer that will directly create new 
> record from the old one.
>  # Records are currently copied for every Executor even for Simple one which 
> actually is not buffering any records and therefore doesn't require records 
> to be copied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5678) deduceShuffleParallelism Returns 0 when that should never happen

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5678:
--
Status: In Progress  (was: Open)

> deduceShuffleParallelism Returns 0 when that should never happen
> 
>
> Key: HUDI-5678
> URL: https://issues.apache.org/jira/browse/HUDI-5678
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jonathan Vexler
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: image (1).png
>
>
> This test 
> {code:java}
>   forAll(BulkInsertSortMode.values().toList) { (sortMode: BulkInsertSortMode) 
> =>val sortModeName = sortMode.name()test(s"Test Bulk Insert with 
> BulkInsertSortMode: '$sortModeName'") {  withTempDir { basePath =>
> testBulkInsertPartitioner(basePath, sortModeName)  }}  }
>   def testBulkInsertPartitioner(basePath: File, sortModeName: String): Unit = 
> {val tableName = generateTableName//Remove these with [HUDI-5419]
> spark.sessionState.conf.unsetConf("hoodie.datasource.write.operation")
> spark.sessionState.conf.unsetConf("hoodie.datasource.write.insert.drop.duplicates")
> 
> spark.sessionState.conf.unsetConf("hoodie.merge.allow.duplicate.on.inserts")  
>   
> spark.sessionState.conf.unsetConf("hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled")
> //Default parallelism is 200 which means in global sort, each record will 
> end up in a different spark partition so//9 files would be created. 
> Setting parallelism to 3 so that each spark partition will contain a hudi 
> partition.val parallelism = if 
> (sortModeName.equals(BulkInsertSortMode.GLOBAL_SORT.name())) {  
> "hoodie.bulkinsert.shuffle.parallelism = 3,"} else {  ""}
> spark.sql(  s""" |create table $tableName ( |  id int,
>  |  name string, |  price double, |  dt string |) 
> using hudi | tblproperties ( |  primaryKey = 'id', |  
> preCombineField = 'name', |  type = 'cow', |  $parallelism
>  |  hoodie.bulkinsert.sort.mode = '$sortModeName' | ) | 
> partitioned by (dt) | location 
> '${basePath.getCanonicalPath}/$tableName'""".stripMargin)
> spark.sql("set hoodie.sql.bulk.insert.enable = true")spark.sql("set 
> hoodie.sql.insert.mode = non-strict")spark.sql(  s"""insert into 
> $tableName  values |(5, 'a', 35, '2021-05-21'), |(1, 'a', 31, 
> '2021-01-21'), |(3, 'a', 33, '2021-03-21'), |(4, 'b', 16, 
> '2021-05-21'), |(2, 'b', 18, '2021-01-21'), |(6, 'b', 17, 
> '2021-03-21'), |(8, 'a', 21, '2021-05-21'), |(9, 'a', 22, 
> '2021-01-21'), |(7, 'a', 23, '2021-03-21') |""".stripMargin)  
>   assertResult(3)(spark.sql(s"select distinct _hoodie_file_name from 
> $tableName").count())  } {code}
> Fails due to 
> {code:java}
> requirement failed: Number of partitions (0) must be positive.
> java.lang.IllegalArgumentException: requirement failed: Number of partitions 
> (0) must be positive.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Repartition.(basicLogicalOperators.scala:951)
>   at org.apache.spark.sql.Dataset.coalesce(Dataset.scala:2946)
>   at 
> org.apache.hudi.execution.bulkinsert.PartitionSortPartitionerWithRows.repartitionRecords(PartitionSortPartitionerWithRows.java:48)
>   at 
> org.apache.hudi.execution.bulkinsert.PartitionSortPartitionerWithRows.repartitionRecords(PartitionSortPartitionerWithRows.java:34)
>   at 
> org.apache.hudi.HoodieDatasetBulkInsertHelper$.prepareForBulkInsert(HoodieDatasetBulkInsertHelper.scala:124)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.bulkInsertAsRow(HoodieSparkSqlWriter.scala:763)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:239)
>   at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:107)
>   at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand.run(InsertIntoHoodieTableCommand.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
>   at 

[jira] [Updated] (HUDI-5681) Merge Into fails while deserializing expressions

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5681:
--
Sprint: Sprint 2023-01-31

> Merge Into fails while deserializing expressions
> 
>
> Key: HUDI-5681
> URL: https://issues.apache.org/jira/browse/HUDI-5681
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> While running our benchmark suite against 0.13 RC, we've stumbled upon 
> following exceptions:
> {code:java}
> 23/02/01 08:29:01 ERROR TaskSetManager: Task 1 in stage 947.0 failed 4 times; 
> aborting job
> 2023-02-01T08:29:01.219 ERROR: merge:1:inventory
> Job aborted due to stage failure: Task 1 in stage 947.0 failed 4 times, most 
> recent failure: Lost task 1.3 in stage 947.0 (TID 101955) 
> (ip-172-31-18-9.us-west-2.compute.internal executor 140): 
> org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
> UPDATE for partition :1
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:336)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleInsertPartition(BaseSparkCommitActionExecutor.java:342)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:253)
>   at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378)
>   at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1525)
>   at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:138)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 
> org.apache.spark.sql.catalyst.expressions.Literal
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804)
>   at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:221)
>   at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:199)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:408)
>   at org.apache.spark.sql.hudi.SerDeUtils$.toObject(SerDeUtils.scala:42)
>   at 
> org.apache.spark.sql.hudi.command.payload.ExpressionPayload$$anon$7.apply(ExpressionPayload.scala:423)
>   at 
> 

[jira] [Created] (HUDI-5681) Merge Into fails while deserializing expressions

2023-02-01 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5681:
-

 Summary: Merge Into fails while deserializing expressions
 Key: HUDI-5681
 URL: https://issues.apache.org/jira/browse/HUDI-5681
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark-sql
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
 Fix For: 0.13.0


While running our benchmark suite against 0.13 RC, we've stumbled upon 
following exceptions:
{code:java}
23/02/01 08:29:01 ERROR TaskSetManager: Task 1 in stage 947.0 failed 4 times; 
aborting job
2023-02-01T08:29:01.219 ERROR: merge:1:inventory
Job aborted due to stage failure: Task 1 in stage 947.0 failed 4 times, most 
recent failure: Lost task 1.3 in stage 947.0 (TID 101955) 
(ip-172-31-18-9.us-west-2.compute.internal executor 140): 
org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
UPDATE for partition :1
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:336)
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleInsertPartition(BaseSparkCommitActionExecutor.java:342)
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:253)
at 
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
at 
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378)
at 
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1525)
at 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:138)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 
org.apache.spark.sql.catalyst.expressions.Literal
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804)
at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:221)
at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:199)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
at 
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:408)
at org.apache.spark.sql.hudi.SerDeUtils$.toObject(SerDeUtils.scala:42)
at 
org.apache.spark.sql.hudi.command.payload.ExpressionPayload$$anon$7.apply(ExpressionPayload.scala:423)
at 
org.apache.spark.sql.hudi.command.payload.ExpressionPayload$$anon$7.apply(ExpressionPayload.scala:419)
at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2405)
at 
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
at 

[GitHub] [hudi] vinothchandar commented on a diff in pull request #7821: [MINOR] Fixing Kryo being instantiated w/ invalid `SparkConf`

2023-02-01 Thread via GitHub


vinothchandar commented on code in PR #7821:
URL: https://github.com/apache/hudi/pull/7821#discussion_r1093790583


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala:
##
@@ -188,7 +188,6 @@ trait ProvidesHoodieConfig extends Logging {
 PRECOMBINE_FIELD.key -> preCombineField,
 PARTITIONPATH_FIELD.key -> partitionFieldsStr,
 PAYLOAD_CLASS_NAME.key -> payloadClassName,
-HoodieWriteConfig.COMBINE_BEFORE_INSERT.key -> 
String.valueOf(hasPrecombineColumn),

Review Comment:
   why do we change this file for this PR?



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/SerDeUtils.scala:
##
@@ -18,18 +18,28 @@
 package org.apache.spark.sql.hudi
 
 import org.apache.hudi.common.util.BinaryUtil
-import org.apache.spark.SparkConf
+import org.apache.spark.internal.config.Kryo.KRYO_USE_POOL
+import org.apache.spark.{SparkConf, SparkEnv}
 import org.apache.spark.serializer.{KryoSerializer, SerializerInstance}
 
 import java.nio.ByteBuffer
 
 
+// TODO merge w/ SerializationUtils
 object SerDeUtils {
 
-  private val SERIALIZER_THREAD_LOCAL = new ThreadLocal[SerializerInstance] {
+  private lazy val conf = {
+val conf = Option(SparkEnv.get)
+  // TODO elaborate

Review Comment:
   fix comment



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin opened a new pull request, #7821: [MINOR] Fixing Kryo being instantiated w/ invalid `SparkConf`

2023-02-01 Thread via GitHub


alexeykudinkin opened a new pull request, #7821:
URL: https://github.com/apache/hudi/pull/7821

   ### Change Logs
   
   TBA
   
   ### Impact
   
   TBA
   
   ### Risk level (write none, low medium or high below)
   
   Low
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7818: [HUDI-5678] deduceShuffleParallelism Returns 0

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7818:
URL: https://github.com/apache/hudi/pull/7818#issuecomment-1412757523

   
   ## CI report:
   
   * 711e13f7eb54f36c755e6a57966c228df161bd0c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14850)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7816: [HUDI-5676] Fix BigQuerySyncTool standalone mode

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7816:
URL: https://github.com/apache/hudi/pull/7816#issuecomment-1412757434

   
   ## CI report:
   
   * 838d7b43b55595c23b9e71b4abea0c40215fd7cd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14849)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7359: [HUDI-3304] WIP - Allow selective partial update

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7359:
URL: https://github.com/apache/hudi/pull/7359#issuecomment-1412755725

   
   ## CI report:
   
   * 7f3578b831243a80ca4a2d79c9e4ff2ebd52e563 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14851)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [MINOR] Restoring existing behavior for `DeltaStreamer` Incremental Source (#7810)

2023-02-01 Thread akudinkin
This is an automated email from the ASF dual-hosted git repository.

akudinkin pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 7064c380506 [MINOR] Restoring existing behavior for `DeltaStreamer` 
Incremental Source (#7810)
7064c380506 is described below

commit 7064c380506814964dd85773e2ee7b7f187b88c3
Author: Alexey Kudinkin 
AuthorDate: Wed Feb 1 11:19:45 2023 -0800

[MINOR] Restoring existing behavior for `DeltaStreamer` Incremental Source 
(#7810)

This is restoring existing behavior for DeltaStreamer Incremental Source, 
as the change in #7769 removed _hoodie_partition_path field from the dataset 
making it impossible to be accessed from the DS Transformers for ex
---
 .../org/apache/hudi/config/HoodieWriteConfig.java  |  2 +-
 .../apache/hudi/common/config/HoodieConfig.java|  8 
 .../org/apache/hudi/utilities/UtilHelpers.java | 13 
 .../hudi/utilities/deltastreamer/DeltaSync.java|  8 ++--
 .../hudi/utilities/sources/HoodieIncrSource.java   | 23 --
 5 files changed, 37 insertions(+), 17 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
index e6525a2b1dc..f56defe7eac 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
@@ -1034,7 +1034,7 @@ public class HoodieWriteConfig extends HoodieConfig {
   }
 
   public HoodieRecordMerger getRecordMerger() {
-List mergers = 
getSplitStringsOrDefault(RECORD_MERGER_IMPLS).stream()
+List mergers = 
StringUtils.split(getStringOrDefault(RECORD_MERGER_IMPLS), ",").stream()
 .map(String::trim)
 .distinct()
 .collect(Collectors.toList());
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/config/HoodieConfig.java 
b/hudi-common/src/main/java/org/apache/hudi/common/config/HoodieConfig.java
index a48e4202bf9..223b93e5744 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/config/HoodieConfig.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/config/HoodieConfig.java
@@ -142,14 +142,6 @@ public class HoodieConfig implements Serializable {
 return StringUtils.split(getString(configProperty), delimiter);
   }
 
-  public  List getSplitStringsOrDefault(ConfigProperty 
configProperty) {
-return getSplitStringsOrDefault(configProperty, ",");
-  }
-
-  public  List getSplitStringsOrDefault(ConfigProperty 
configProperty, String delimiter) {
-return StringUtils.split(getStringOrDefault(configProperty), delimiter);
-  }
-
   public String getString(String key) {
 return props.getProperty(key);
   }
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java
index d159fee0be4..45a9750c3b3 100644
--- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java
+++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java
@@ -29,12 +29,16 @@ import org.apache.hudi.client.WriteStatus;
 import org.apache.hudi.client.common.HoodieSparkEngineContext;
 import org.apache.hudi.common.config.DFSPropertiesConfiguration;
 import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.engine.EngineType;
 import org.apache.hudi.common.fs.FSUtils;
 import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.HoodieRecordMerger;
 import org.apache.hudi.common.model.HoodieRecordPayload;
 import org.apache.hudi.common.model.HoodieWriteStat;
 import org.apache.hudi.common.table.HoodieTableMetaClient;
 import org.apache.hudi.common.table.TableSchemaResolver;
+import org.apache.hudi.common.util.ConfigUtils;
+import org.apache.hudi.common.util.HoodieRecordUtils;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.ReflectionUtils;
 import org.apache.hudi.common.util.StringUtils;
@@ -109,6 +113,15 @@ public class UtilHelpers {
 
   private static final Logger LOG = LogManager.getLogger(UtilHelpers.class);
 
+  public static HoodieRecordMerger createRecordMerger(Properties props) {
+List recordMergerImplClasses = 
ConfigUtils.split2List(props.getProperty(HoodieWriteConfig.RECORD_MERGER_IMPLS.key(),
+HoodieWriteConfig.RECORD_MERGER_IMPLS.defaultValue()));
+HoodieRecordMerger recordMerger = 
HoodieRecordUtils.createRecordMerger(null, EngineType.SPARK, 
recordMergerImplClasses,
+props.getProperty(HoodieWriteConfig.RECORD_MERGER_STRATEGY.key(), 
HoodieWriteConfig.RECORD_MERGER_STRATEGY.defaultValue()));
+
+return recordMerger;
+  }
+
   public static Source createSource(String sourceClass, 

[GitHub] [hudi] alexeykudinkin merged pull request #7810: [MINOR] Restoring existing behavior for `DeltaStreamer` Incremental Source

2023-02-01 Thread via GitHub


alexeykudinkin merged PR #7810:
URL: https://github.com/apache/hudi/pull/7810


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on pull request #7810: [MINOR] Restoring existing behavior for `DeltaStreamer` Incremental Source

2023-02-01 Thread via GitHub


alexeykudinkin commented on PR #7810:
URL: https://github.com/apache/hudi/pull/7810#issuecomment-1412590354

   CI is green:
   
   https://user-images.githubusercontent.com/428277/216141593-26de921f-5846-4387-a04d-699e2e66d546.png;>
   
   
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=14839=results


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] imuntyan opened a new issue, #7820: [SUPPORT] Errors are thrown when upserting a record with cleaner service enabled

2023-02-01 Thread via GitHub


imuntyan opened a new issue, #7820:
URL: https://github.com/apache/hudi/issues/7820

   Enabling the hudi cleaner service (sync or async) throws an error when 
trying to upsert a record in spark append mode.
   
   **To Reproduce**
   
   I am running the following script in the Jupyter notebook:
   ```python
   from numpy import random
   
   hudi_mode, spark_mode = 'upsert', 'append'
   insert_hudi_path = "s3://[redacted]/hudi/test01"
   
   data = [
   {"id": f'id-{random.randint(10)}', "text": 
f'text-{random.randint(10)}'}
   ]

   df = spark.createDataFrame(data)
   
   table_name = 'table_test01'
   primary_key="id"
   precombine="text"
   hudi_options = {
   'hoodie.table.name': table_name,
   'hoodie.datasource.write.operation': hudi_mode,
   'hoodie.datasource.write.recordkey.field': primary_key,
   'hoodie.datasource.write.precombine.field': precombine,
   'hoodie.metadata.enable': True,
   'hoodie.clean.automatic': True,
   # 'hoodie.clean.async': True,
   'hoodie.cleaner.commits.retained': 1,
   'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.NonpartitionedKeyGenerator'
   }
   
   df.write.format('hudi') \
   .options(**hudi_options) \
   .mode(spark_mode) \
   .save(insert_hudi_path)
   ```
   When the S3 location is empty this script executes fine and creates the data 
in S3. When executing it again, it throws the errors attached below. The errors 
are thrown for both sync and async cleaner mode (the async mode throws the 
errors on the third run though).
   
   The errors are not returned when the following configuration is commented 
out:
   ```
   #'hoodie.clean.automatic': True,
   #'hoodie.clean.async': True,
   #'hoodie.cleaner.commits.retained': 1,
   
   ```
   
   **Expected behavior**
   
   No errors.
   
   **Environment Description**
   
   * Hudi version : 0.12.1-amzn-0-SNAPSHOT
   
   * Spark version : 3.3.0
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.3.3
   
   * AWS EMR version: emr-6.9.0
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : EMR on EC2
   
   
   **Stacktrace**
   
   
[emr-errors.txt](https://github.com/apache/hudi/files/10560680/emr-errors.txt)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7359: [HUDI-3304] WIP - Allow selective partial update

2023-02-01 Thread via GitHub


hudi-bot commented on PR #7359:
URL: https://github.com/apache/hudi/pull/7359#issuecomment-1412560205

   
   ## CI report:
   
   * adce700376e9214504bdf08a43a6b345c920345c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14822)
 
   * 7f3578b831243a80ca4a2d79c9e4ff2ebd52e563 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14851)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on issue #7791: [SUPPORT] Don't see metadata folder in .hoodie folder when ingesting data use hudi kafka connector

2023-02-01 Thread via GitHub


yihua commented on issue #7791:
URL: https://github.com/apache/hudi/issues/7791#issuecomment-1412558758

   Hi @duc-dn I created this ticket to track the support of metadata table for 
Kafka Connect Sink Connector: https://issues.apache.org/jira/browse/HUDI-5680.  
For now, you should remove `hoodie.metadata.enable` config from the sink 
configs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5680) Support metadata table in Kafka Connect writer

2023-02-01 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5680:

Priority: Critical  (was: Major)

> Support metadata table in Kafka Connect writer
> --
>
> Key: HUDI-5680
> URL: https://issues.apache.org/jira/browse/HUDI-5680
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   >