date:20221128

[jira] [Updated] (HUDI-5293) Schema on read + reconcile schema fails w/ 0.12.1

2022-11-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5293:
-
Labels: pull-request-available  (was: )

> Schema on read + reconcile schema fails w/ 0.12.1
> -
>
> Key: HUDI-5293
> URL: https://issues.apache.org/jira/browse/HUDI-5293
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>
> if I do schema on read on commit1 and then schema on read + reconcile schema 
> for 2nd batch, it fails w/ 
> {code:java}
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
> 22/11/28 16:44:26 ERROR BaseSparkCommitActionExecutor: Error upserting 
> bucketType UPDATE for partition :2
> java.lang.IllegalArgumentException: cannot modify hudi meta col: 
> _hoodie_commit_time
>   at 
> org.apache.hudi.internal.schema.action.TableChange$BaseColumnChange.checkColModifyIsLegal(TableChange.java:157)
>   at 
> org.apache.hudi.internal.schema.action.TableChanges$ColumnAddChange.addColumns(TableChanges.java:314)
>   at 
> org.apache.hudi.internal.schema.utils.AvroSchemaEvolutionUtils.lambda$reconcileSchema$5(AvroSchemaEvolutionUtils.java:92)
>   at 
> java.util.TreeMap$EntrySpliterator.forEachRemaining(TreeMap.java:2969)
>   at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
>   at 
> org.apache.hudi.internal.schema.utils.AvroSchemaEvolutionUtils.reconcileSchema(AvroSchemaEvolutionUtils.java:80)
>   at 
> org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:103)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:358)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:349)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:359)
>   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:357)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:308)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748) {code}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820

[GitHub] [hudi] hudi-bot commented on pull request #7323: Update HoodieBackedTableMetadata.java

2022-11-28 Thread GitBox



hudi-bot commented on PR #7323:
URL: https://github.com/apache/hudi/pull/7323#issuecomment-1330225815

   
   ## CI report:
   
   * ff7723a4143159db1873f2a453338f2b5dd88bbd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13311)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan opened a new pull request, #7324: [HUDI-5293] Fixing reader schema w/ Merge handle helper w/ schema on read enable + reconcile schema

2022-11-28 Thread GitBox



nsivabalan opened a new pull request, #7324:
URL: https://github.com/apache/hudi/pull/7324

   ### Change Logs
   
   Commit1:
   schema.on.read enable
   
   commit2:
   schema.on.read.enable
   reconcile.schema
   
   commit 2 fails w/ 
   ```
   
   java.lang.IllegalArgumentException: cannot modify hudi meta col: 
_hoodie_commit_time
at 
org.apache.hudi.internal.schema.action.TableChange$BaseColumnChange.checkColModifyIsLegal(TableChange.java:157)
at 
org.apache.hudi.internal.schema.action.TableChanges$ColumnAddChange.addColumns(TableChanges.java:314)
at 
org.apache.hudi.internal.schema.utils.AvroSchemaEvolutionUtils.lambda$reconcileSchema$5(AvroSchemaEvolutionUtils.java:92)
   ```
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-5293) Schema on read + reconcile schema fails w/ 0.12.1

2022-11-28 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-5293:
-

 Summary: Schema on read + reconcile schema fails w/ 0.12.1
 Key: HUDI-5293
 URL: https://issues.apache.org/jira/browse/HUDI-5293
 Project: Apache Hudi
  Issue Type: Improvement
  Components: writer-core
Reporter: sivabalan narayanan


if I do schema on read on commit1 and then schema on read + reconcile schema 
for 2nd batch, it fails w/ 
{code:java}
warning: there was one deprecation warning; re-run with -deprecation for details
22/11/28 16:44:26 ERROR BaseSparkCommitActionExecutor: Error upserting 
bucketType UPDATE for partition :2
java.lang.IllegalArgumentException: cannot modify hudi meta col: 
_hoodie_commit_time
at 
org.apache.hudi.internal.schema.action.TableChange$BaseColumnChange.checkColModifyIsLegal(TableChange.java:157)
at 
org.apache.hudi.internal.schema.action.TableChanges$ColumnAddChange.addColumns(TableChanges.java:314)
at 
org.apache.hudi.internal.schema.utils.AvroSchemaEvolutionUtils.lambda$reconcileSchema$5(AvroSchemaEvolutionUtils.java:92)
at 
java.util.TreeMap$EntrySpliterator.forEachRemaining(TreeMap.java:2969)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
at 
org.apache.hudi.internal.schema.utils.AvroSchemaEvolutionUtils.reconcileSchema(AvroSchemaEvolutionUtils.java:80)
at 
org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:103)
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:358)
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:349)
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322)
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:359)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:357)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:308)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) {code}
 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5293) Schema on read + reconcile schema fails w/ 0.12.1

2022-11-28 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-5293:
--
Fix Version/s: 0.12.2

> Schema on read + reconcile schema fails w/ 0.12.1
> -
>
> Key: HUDI-5293
> URL: https://issues.apache.org/jira/browse/HUDI-5293
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.2
>
>
> if I do schema on read on commit1 and then schema on read + reconcile schema 
> for 2nd batch, it fails w/ 
> {code:java}
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
> 22/11/28 16:44:26 ERROR BaseSparkCommitActionExecutor: Error upserting 
> bucketType UPDATE for partition :2
> java.lang.IllegalArgumentException: cannot modify hudi meta col: 
> _hoodie_commit_time
>   at 
> org.apache.hudi.internal.schema.action.TableChange$BaseColumnChange.checkColModifyIsLegal(TableChange.java:157)
>   at 
> org.apache.hudi.internal.schema.action.TableChanges$ColumnAddChange.addColumns(TableChanges.java:314)
>   at 
> org.apache.hudi.internal.schema.utils.AvroSchemaEvolutionUtils.lambda$reconcileSchema$5(AvroSchemaEvolutionUtils.java:92)
>   at 
> java.util.TreeMap$EntrySpliterator.forEachRemaining(TreeMap.java:2969)
>   at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
>   at 
> org.apache.hudi.internal.schema.utils.AvroSchemaEvolutionUtils.reconcileSchema(AvroSchemaEvolutionUtils.java:80)
>   at 
> org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:103)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:358)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:349)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:359)
>   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:357)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:308)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748) {code}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #7320: [HUDI-5290] remove the lock in #writeTableMetadata

2022-11-28 Thread GitBox



hudi-bot commented on PR #7320:
URL: https://github.com/apache/hudi/pull/7320#issuecomment-1330220573

   
   ## CI report:
   
   * 4f6c23b63efd79d7c9c488e232959623c8e346f5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13303)
 
   * 56aa2ec4d7955f7db505b908a14cedbc285e692c UNKNOWN
   * 25273be99968e827c30c215b6c0e3809ea0b904a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13310)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7323: Update HoodieBackedTableMetadata.java

2022-11-28 Thread GitBox



hudi-bot commented on PR #7323:
URL: https://github.com/apache/hudi/pull/7323#issuecomment-1330220608

   
   ## CI report:
   
   * ff7723a4143159db1873f2a453338f2b5dd88bbd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] lufzhangzitao commented on issue #7283: [SUPPORT] Schema evolution fails due to NPE with `hoodie.schema.on.read.enable` turned on

2022-11-28 Thread GitBox



lufzhangzitao commented on issue #7283:
URL: https://github.com/apache/hudi/issues/7283#issuecomment-1330218337

   @xiarixiaoyao 
thx I tried hudi 0.12.1 schema evolution can work with 
hoodie.datasource.write.reconcile.schema ，
   But it seems the hudi document example cannot  work , I try the example  add 
a new string field and change the datatype of a field from int to long in hudi 
0.12.1, will throw  exception :
   org.apache.avro.AvroRuntimeException: cannot support rewrite value for 
schema type: "int" since the old schema type is: "long"
at 
org.apache.hudi.avro.HoodieAvroUtils.rewritePrimaryTypeWithDiffSchemaType(HoodieAvroUtils.java:955)
at 
org.apache.hudi.avro.HoodieAvroUtils.rewritePrimaryType(HoodieAvroUtils.java:869)
at 
org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchema(HoodieAvroUtils.java:820)
at 
org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchema(HoodieAvroUtils.java:818)
at 
org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchema(HoodieAvroUtils.java:772)
at 
org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchema(HoodieAvroUtils.java:734)
at 
org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordDeep(HoodieAvroUtils.java:1034)
at 
org.apache.hudi.HoodieSparkUtils$.$anonfun$createRdd$4(HoodieSparkUtils.scala:109)
at 
org.apache.hudi.HoodieSparkUtils$.$anonfun$createRdd$5(HoodieSparkUtils.scala:117)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
   
   
   
![企业微信截图_1669707358199](https://user-images.githubusercontent.com/77084940/204467890-ef9d7c91-2c18-461f-93e6-813d7aab3052.png)
   
   
![企业微信截图_16697074283529](https://user-images.githubusercontent.com/77084940/204467975-7d97b32a-1494-40c2-8b1f-5d0ef49d4a05.png)
   
   
![企业微信截图_16697074634013](https://user-images.githubusercontent.com/77084940/204468003-2f9260b8-02ad-41b9-8f80-fd5b8f160042.png)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7320: [HUDI-5290] remove the lock in #writeTableMetadata

2022-11-28 Thread GitBox



hudi-bot commented on PR #7320:
URL: https://github.com/apache/hudi/pull/7320#issuecomment-1330215980

   
   ## CI report:
   
   * 4f6c23b63efd79d7c9c488e232959623c8e346f5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13303)
 
   * 56aa2ec4d7955f7db505b908a14cedbc285e692c UNKNOWN
   * 25273be99968e827c30c215b6c0e3809ea0b904a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] onlywangyh commented on issue #7298: Hudi use regular match query partition path caused invalid input path was added

2022-11-28 Thread GitBox



onlywangyh commented on issue #7298:
URL: https://github.com/apache/hudi/issues/7298#issuecomment-1330215715

   > > E.g., Table has partition 4 partitions:
   > > year=2022/month=08/day=30, year=20222/month=08/day=31, 
year=2022/month=07/day=03, year=2022/month=077/day=04
   > > Prefix "year=2022" we expect return year=2022/month=08/day=30, 
year=2022/month=07/day=03, year=2022/month=077/day=04, actual return all 
partitions, while prefix "year=2022/month=07" we expect 
year=2022/month=07/day=03, actual output only two partitions.
   > 
   > Are you saying the filter is not working this way currently?
   
   Thats right. like this
   
   Table has partition 2 partitions:
   year=2022/month=08/day=30
   year=2022back/month=08/day=31
   
   When query "year=2022",  we only want get year "2022" not the "2022back". 
   Expect paths  _[year=2022/month=08/day=30]_.  
   actual paths [year=2022/month=08/day=30,  year=2022back/month=08/day=30]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] onlywangyh opened a new pull request, #7323: Update HoodieBackedTableMetadata.java

2022-11-28 Thread GitBox



onlywangyh opened a new pull request, #7323:
URL: https://github.com/apache/hudi/pull/7323

   Fix query partition path add  invalid input path.  In function 
getPartitionPathsWithPrefixes will fetch list of all partitions path that whose 
relative partition paths match the given prefixes. It caused a unexpected path 
add and throw exception in hive.  Like this:
   
   Table has partition 2 partitions:
   year=2022/month=08/day=30
   year=2022back/month=08/day=31
   
   When query "year=2022",  we only want get the path 2022 not the 2022back. 
   Expect paths  _[year=2022/month=08/day=30]_.  
   actual paths [year=2022/month=08/day=30,  year=2022back/month=08/day=30]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7320: [HUDI-5290] remove the lock in #writeTableMetadata

2022-11-28 Thread GitBox



hudi-bot commented on PR #7320:
URL: https://github.com/apache/hudi/pull/7320#issuecomment-1330204601

   
   ## CI report:
   
   * 4f6c23b63efd79d7c9c488e232959623c8e346f5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13303)
 
   * 56aa2ec4d7955f7db505b908a14cedbc285e692c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] KnightChess commented on pull request #7304: [HUDI-5278]support more conf to cluster procedure

2022-11-28 Thread GitBox



KnightChess commented on PR #7304:
URL: https://github.com/apache/hudi/pull/7304#issuecomment-1330203440

   Don't merge first, I meet some question in cluster


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope closed issue #7060: Error when upgrading to hudi 0.12.0 from 0.9.0

2022-11-28 Thread GitBox



codope closed issue #7060: Error when upgrading to hudi 0.12.0 from 0.9.0
URL: https://github.com/apache/hudi/issues/7060


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on issue #7060: Error when upgrading to hudi 0.12.0 from 0.9.0

2022-11-28 Thread GitBox



codope commented on issue #7060:
URL: https://github.com/apache/hudi/issues/7060#issuecomment-1330198915

   @navbalaraman The issue due to partition field was fixed recently by 
https://github.com/apache/hudi/pull/7132
   Please reopen another issue if it still persists. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-5181) Enhance keygen class validation

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5181:
-
Sprint: 2022/12/06  (was: 2022/11/29)

> Enhance keygen class validation
> ---
>
> Key: HUDI-5181
> URL: https://issues.apache.org/jira/browse/HUDI-5181
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: configs
>Reporter: Raymond Xu
>Priority: Major
> Fix For: 0.12.2
>
>
> Some in-code validations can be added to early alert users who set keygen 
> configs improperly. For example, in TimestampBased keygen, output format 
> cannot be empty.
> We should audit all built-in keygen classes and add UTs and proper 
> validations. This is to improve usability and save time in troubleshooting 
> when misconfig happened.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated: [HUDI-5285] Exclude *-site.xml files from jar packaging (#7310)

2022-11-28 Thread codope

This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 5b39ac1a02 [HUDI-5285] Exclude *-site.xml files from jar packaging 
(#7310)
5b39ac1a02 is described below

commit 5b39ac1a02f158b25ba39000271513203dc6bf4c
Author: Sagar Sumit 
AuthorDate: Tue Nov 29 12:54:17 2022 +0530

[HUDI-5285] Exclude *-site.xml files from jar packaging (#7310)
---
 hudi-utilities/pom.xml | 4 
 1 file changed, 4 insertions(+)

diff --git a/hudi-utilities/pom.xml b/hudi-utilities/pom.xml
index 6af503395f..ac49ba6f9f 100644
--- a/hudi-utilities/pom.xml
+++ b/hudi-utilities/pom.xml
@@ -72,6 +72,10 @@
   
   
 src/test/resources
+
+  *-site.xml
+
+false

[GitHub] [hudi] codope closed issue #7292: [SUPPORT] a hive-site.xml configuration file in the "hudi-utilities_2.12-0.12.1. jar" file, which causes the spark cluster to fail to access the external hiv

2022-11-28 Thread GitBox



codope closed issue #7292: [SUPPORT] a hive-site.xml configuration file in the 
"hudi-utilities_2.12-0.12.1. jar" file, which causes the spark cluster to fail 
to access the external hive source normally.
URL: https://github.com/apache/hudi/issues/7292


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope merged pull request #7310: [HUDI-5285] Exclude *-site.xml files from jar packaging

2022-11-28 Thread GitBox



codope merged PR #7310:
URL: https://github.com/apache/hudi/pull/7310


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on pull request #7310: [HUDI-5285] Exclude *-site.xml files from jar packaging

2022-11-28 Thread GitBox



codope commented on PR #7310:
URL: https://github.com/apache/hudi/pull/7310#issuecomment-1330193556

   > > @xushiyan Can you please review this patch? I think we can exclude the 
test resources (especially the properties files that conflict with 
user-provided resources) from every module.
   > 
   > sounds good. make sure test resources are used strictly by test code
   
   Sure. Created HUDI-5292 to track.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-5292) Exclude the test resources from every module packaging

2022-11-28 Thread Sagar Sumit (Jira)

Sagar Sumit created HUDI-5292:
-

 Summary: Exclude the test resources from every module packaging
 Key: HUDI-5292
 URL: https://issues.apache.org/jira/browse/HUDI-5292
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Sagar Sumit
 Fix For: 0.12.2


Exclude the test resources, especially the properties files that conflict with 
user-provided resources, from every module. This is a followup to 
https://github.com/apache/hudi/pull/7310#issuecomment-1328728297



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] JoshuaZhuCN opened a new issue, #7322: [SUPPORT][HELP] SparkSQL can not read the latest change data without execute "refresh table xxx"

2022-11-28 Thread GitBox



JoshuaZhuCN opened a new issue, #7322:
URL: https://github.com/apache/hudi/issues/7322

   SparkSQL can not read the latest change data without execute "refresh table 
xxx" after write the data in datasource mode
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. run spark-shell and import class 
   ``` import org.apache.spark.sql.SaveMode ```
   2. create like this:
   ```
   spark.sql(
   s"""|CREATE TABLE IF NOT EXISTS `malawi`.`hudi_0_12_1_spark_test` (
   | `id` INT
   |,`name` STRING
   |,`age` INT
   |,`sync_time` TIMESTAMP
   |) USING HUDI
   |TBLPROPERTIES (
   | type = 'mor'
   |,primaryKey = 'id'
   |,preCombineField = 'sync_time'
   |,`hoodie.bucket.index.hash.field` = ''
   |,`hoodie.datasource.write.hive_style_partitioning` = 'false'
   |
,`hoodie.table.keygenerator.class`='org.apache.hudi.keygen.ComplexKeyGenerator'
   |)
   |COMMENT 'hudi_0.12.1_test'""".stripMargin
   )
   
   spark.sql(
   s"""|create table `malawi`.`hudi_0_12_1_spark_test_rt`
   |using hudi
   |options(`hoodie.query.as.ro.table` = 'false')
   |location 
'hdfs://NameNodeService1/hoodie/leqee/malawi/hudi_0_12_1_spark_test';
   |""".stripMargin
   )
   ```
   3. make test data
   ```
   var dfData = spark.sql(
   s"""|select 1 as id,'name1' as name, 18 as age, now() as sync_time 
   | union all 
   |select 2 as id,'name2' as name, 22 as age, now() as sync_time 
   | union all 
   |select 3 as id,'name3' as name, 23 as age, now() as sync_time
   |""".stripMargin
   )
   
   var dfData2 = spark.sql(
   s"""|select 4 as id,'name1' as name, 18 as age, now() as sync_time
   |""".stripMargin
   )
   ```
   4. make hudi datasource options
   ```
   var hoodieProp = Map("hoodie.table.name" -> "hudi_0_12_1_spark_test", 
"hoodie.datasource.write.operation"->"upsert", 
"hoodie.datasource.write.recordkey.field" -> "id", 
"hoodie.datasource.write.keygenerator.class"->"org.apache.hudi.keygen.ComplexKeyGenerator",
 "hoodie.datasource.write.partitionpath.field"->"", 
"hoodie.datasource.write.precombine.field"->"sync_time", 
"hoodie.metadata.enable"->"true", "hoodie.upsert.shuffle.parallelism"->"10", 
"hoodie.embed.timeline.server"->"false")
   ```
   5. write data fist time
   ```
   
dfData.write.format("org.apache.hudi").options(hoodieProp).mode(SaveMode.Append).save("hdfs://xxx/malawi/hudi_0_12_1_spark_test")
   ```
   6. query in spark sql
   ```
   spark.sql(s"""|select *
 | from (
 | select 'ori' as flag,a.* from 
`malawi`.`hudi_0_12_1_spark_test` a
 | union all
 | select '_rt' as flag,b.* from 
`malawi`.`hudi_0_12_1_spark_test_rt` b
 |  ) t
 |order by t.id asc, t.flag asc""".stripMargin
   ).show(false)
   ```
   
![image](https://user-images.githubusercontent.com/62231347/204456307-04f29a4e-6d4c-4c97-9429-32ebeb92f0c1.png)
   
   7. write data second time
   ```
   
dfData2.write.format("org.apache.hudi").options(hoodieProp).mode(SaveMode.Append).save("hdfs://xxx/malawi/hudi_0_12_1_spark_test")
   ```
   8. repeat step 6
   the data(id = 4) should be queried out in _rt table but it is not
   
![image](https://user-images.githubusercontent.com/62231347/204456452-4cf23bad-af3d-48a5-b2c9-1766e61b34ee.png)
   
   9. refresh table
   ```
   spark.sql("REFRESH TABLE `malawi`.`hudi_0_12_1_spark_test`")
   ```
   10. repeat step 6
   
   
![image](https://user-images.githubusercontent.com/62231347/204464598-e46e508f-6bd8-480d-a9fc-2141f83d455e.png)
   
   
   
   **Environment Description**
   
   * Hudi version : 0.12.1
   
   * Spark version : 3.1.3
   
   * Hive version : 3.1.1
   
   * Hadoop version : 3.1.0
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4871) Upgrade spark3.3 to 3.3.1

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4871:
-
Sprint: 2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/11/01, 2022/11/15)

> Upgrade spark3.3 to 3.3.1
> -
>
> Key: HUDI-4871
> URL: https://issues.apache.org/jira/browse/HUDI-4871
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Yuming Wang
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3249) Performance Improvements

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3249:
-
Sprint: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 
2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/08/22, 2022/09/05, 2022/09/19, 
2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15)

> Performance Improvements
> 
>
> Key: HUDI-3249
> URL: https://issues.apache.org/jira/browse/HUDI-3249
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: writer-core
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5269) Enhancing core user flow tests for spark-sql writes

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5269:
-
Sprint: 2022/11/15, 2022/11/29  (was: 2022/11/15)

> Enhancing core user flow tests for spark-sql writes
> ---
>
> Key: HUDI-5269
> URL: https://issues.apache.org/jira/browse/HUDI-5269
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql, tests-ci
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> We triaged some of the core user flows and looks like we don't have a good 
> coverage for those flows. 
>  
>  # 
>  ## {{COW and MOR(w/ and w/o metadata enabled)}}
>  ### {{{}Partitioned(BLOOM, SIMPLE, GLOBAL_BLOOM, }}BUCKET\{{{}){}}}, 
> {{{}non-partitioned(GLOBAL_BLOOM){}}}.
>   
>  # {\{Immutable data. pure bulk_insert row writing. }}
>  # {\{Immutable w/ file sizing. pure inserts. }}
>  # {\{initial bulk ingest, followed by updates. bulk_insert followed by 
> upserts. }}
>  # {{{}regular inserts + updates combined{*}{{*}{ \{{ ** }}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3967) Automatic savepoint in Hudi

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3967:
-
Sprint: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 
2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/08/22, 2022/09/05, 2022/09/19, 
2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15)

> Automatic savepoint in Hudi
> ---
>
> Key: HUDI-3967
> URL: https://issues.apache.org/jira/browse/HUDI-3967
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: table-service
>Reporter: Raymond Xu
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4472) Revisit schema handling in HoodieSparkSqlWriter

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4472:
-
Sprint: 2022/10/04, 2022/10/18, 2022/11/15, 2022/11/29  (was: 2022/10/04, 
2022/10/18, 2022/11/15)

> Revisit schema handling in HoodieSparkSqlWriter
> ---
>
> Key: HUDI-4472
> URL: https://issues.apache.org/jira/browse/HUDI-4472
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> After many features aimed to bring more and more sophisticated support of 
> schema evolution were layered in w/in HoodieSparkSqlWriter, it's currently 
> requiring careful attention to reconcile many flows and make sure that the 
> original invariants still hold.
>  
> One example of the issue was discovered while addressing HUDI-4081 (which was 
> duct-typed in [#6213|https://github.com/apache/hudi/pull/6213/files#] to 
> avoid substantial changes before the release)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4921) Fix last completed commit in CleanPlanner

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4921:
-
Sprint: 2022/09/19, 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15, 
2022/11/29  (was: 2022/09/19, 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15)

> Fix last completed commit in CleanPlanner
> -
>
> Key: HUDI-4921
> URL: https://issues.apache.org/jira/browse/HUDI-4921
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Recently we added last completed commit in as part of clean commit metadata. 
> ideally the value should represent the last completed commit in timeline 
> before er which there are no inflight commits. but we just get the last 
> completed commit in active timeline and setting the value. 
> this needs fixing. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5219) Support "CREATE INDEX" for index function through Spark SQL

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5219:
-
Sprint: 2022/11/15, 2022/11/29  (was: 2022/11/15)

> Support "CREATE INDEX" for index function through Spark SQL
> ---
>
> Key: HUDI-5219
> URL: https://issues.apache.org/jira/browse/HUDI-5219
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3601) Support multi-arch builds in docker setup

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3601:
-
Sprint: 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 2022/11/01, 
2022/11/15, 2022/11/29  (was: 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 
2022/11/01, 2022/11/15)

> Support multi-arch builds in docker setup
> -
>
> Key: HUDI-3601
> URL: https://issues.apache.org/jira/browse/HUDI-3601
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: dependencies
>Reporter: Sagar Sumit
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Refer [https://github.com/apache/hudi/issues/4985]
> Essentially, our current docker demo runs for linux/amd64 platform but not 
> for arm64. We should support multi-arch builds in a fully automated manner. 
> Ideal would be to simply accept a parameter in setup script:
> {code:java}
> docker/setup_demo.sh --platform linux/arm64
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4967) Improve docs for meta sync with TimestampBasedKeyGenerator

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4967:
-
Sprint: 2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/11/01, 2022/11/15)

> Improve docs for meta sync with TimestampBasedKeyGenerator
> --
>
> Key: HUDI-4967
> URL: https://issues.apache.org/jira/browse/HUDI-4967
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>
> Related fix: HUDI-4966
> We need to add docs on how to properly set the meta sync configuration, 
> especially the hoodie.datasource.hive_sync.partition_value_extractor, in 
> [https://hudi.apache.org/docs/key_generation] (for different Hudi versions, 
> the config can be different).  Check the ticket above and PR description of 
> [https://github.com/apache/hudi/pull/6851] for more details.
> We should also add the migration setup on the key generation page as well: 
> [https://hudi.apache.org/releases/release-0.12.0/#configuration-updates]
>  * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config 
> is used to extract and transform partition value during Hive sync. Its 
> default value has been changed from 
> {{SlashEncodedDayPartitionValueExtractor}} to 
> {{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default 
> value (i.e., have not set it explicitly), you are required to set the config 
> to {{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From 
> this release, if this config is not set and Hive sync is enabled, then 
> partition value extractor class will be *automatically inferred* on the basis 
> of number of partition fields and whether or not hive style partitioning is 
> enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-83) Map Timestamp type in spark to corresponding Timestamp type in Hive during Hive sync

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-83?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-83:
---
Sprint: Cont' improve -  2021/01/24, Cont' improve -  2021/01/31, 
2022/09/05, 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29  (was: 
Cont' improve -  2021/01/24, Cont' improve -  2021/01/31, 2022/09/05, 
2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15)

> Map Timestamp type in spark to corresponding Timestamp type in Hive during 
> Hive sync
> 
>
> Key: HUDI-83
> URL: https://issues.apache.org/jira/browse/HUDI-83
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive, meta-sync, Usability
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: cdmikechen
>Priority: Major
>  Labels: pull-request-available, query-eng, sev:critical, 
> user-support-issues
> Fix For: 0.12.2
>
>
> [https://github.com/apache/incubator-hudi/issues/543] &; related issues 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5157) Duplicate partition path for chained hudi tables.

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5157:
-
Sprint: 2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/11/01, 2022/11/15)

> Duplicate partition path for chained hudi tables. 
> --
>
> Key: HUDI-5157
> URL: https://issues.apache.org/jira/browse/HUDI-5157
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>
> tblA -> tblB -> tblC. 
> if I chain 3 hudi tables via hoodieincr source, for tblC, we encounter 
> duplicate partition path (meta field) exception. 
> {code:java}
> client token: N/A
>diagnostics: User class threw exception: 
> org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data 
> schema: `_hoodie_partition_path`;
>   at 
> org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:90)
>   at 
> org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:70)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:440)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
>   at 
> com.navi.sources.HoodieIncrSource.fetchNextBatch(HoodieIncrSource.java:122)
>   at 
> org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43)
>   at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:76)
>   at 
> org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:95)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:388)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:283)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:193)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:191)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:511)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:728)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5231) Address checkstyle warnings while building hudi

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5231:
-
Sprint: 2022/11/15, 2022/11/29  (was: 2022/11/15)

> Address checkstyle warnings while building hudi
> ---
>
> Key: HUDI-5231
> URL: https://issues.apache.org/jira/browse/HUDI-5231
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dev-experience
>Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>
> As of now, we see lot of checkstyle warnings while building hudi. We need to 
> take a look to see how exactly we can fix this. 
>  * Take a stab at fixing as much as possible.
>  * If we couldn't get it to a controllable number, suppress the checkstyle 
> warnings. 
>  
> excerpt logs
> {code:java}
> [INFO] 
> /Users/nsb/Documents/personal/projects/nov26/hudi/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestGcsEventsHoodieIncrSource.java:46:
>  'org.apache.log4j.LogManager' should be separated from previous imports. 
> [ImportOrder]
> [INFO] 
> /Users/nsb/Documents/personal/projects/nov26/hudi/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestGcsEventsHoodieIncrSource.java:74:
>  Missing a Javadoc comment. [JavadocType]
> [INFO] 
> /Users/nsb/Documents/personal/projects/nov26/hudi/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestGcsEventsHoodieIncrSource.java:271:
>  Missing a Javadoc comment. [JavadocType]
> [INFO] 
> /Users/nsb/Documents/personal/projects/nov26/hudi/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestInputBatch.java:33:
>  Missing a Javadoc comment. [JavadocType]
> [INFO] 
> /Users/nsb/Documents/personal/projects/nov26/hudi/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestJsonKafkaSourcePostProcessor.java:22:
>  Import org.apache.hudi.common.config.TypedProperties appears after other 
> imports that it should precede [ImportOrder]
> [INFO] 
> /Users/nsb/Documents/personal/projects/nov26/hudi/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestJsonKafkaSourcePostProcessor.java:61:
>  Missing a Javadoc comment. [JavadocType]
> [INFO] 
> /Users/nsb/Documents/personal/projects/nov26/hudi/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestJsonKafkaSourcePostProcessor.java:315:
>  Missing a Javadoc comment. [JavadocType]
> [INFO] 
> /Users/nsb/Documents/personal/projects/nov26/hudi/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/helpers/TestCheckpointUtils.java:24:
>  Import 
> org.apache.hudi.utilities.sources.helpers.KafkaOffsetGen.CheckpointUtils 
> appears after other imports that it should precede [ImportOrder]
> [INFO] 
> /Users/nsb/Documents/personal/projects/nov26/hudi/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/helpers/TestS3EventsMetaSelector.java:56:
>  Missing a Javadoc comment. [JavadocType]
> [INFO] 
> /Users/nsb/Documents/personal/projects/nov26/hudi/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/helpers/TestProtoConversionUtil.java:47:
>  Import com.google.protobuf.util.Timestamps appears after other imports that 
> it should precede [ImportOrder]
> [INFO] 
> /Users/nsb/Documents/personal/projects/nov26/hudi/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/helpers/TestProtoConversionUtil.java:72:
>  Missing a Javadoc comment. [JavadocType]
> [INFO] 
> /Users/nsb/Documents/personal/projects/nov26/hudi/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/helpers/TestDFSPathSelectorCommonMethods.java:46:
>  Missing a Javadoc comment. [JavadocType]
> [INFO] 
> /Users/nsb/Documents/personal/projects/nov26/hudi/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/helpers/TestCloudObjectsSelector.java:57:
>  Missing a Javadoc comment. [JavadocType]
> [INFO] 
> /Users/nsb/Documents/personal/projects/nov26/hudi/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/helpers/TestDatePartitionPathSelector.java:49:
>  Missing a Javadoc comment. [JavadocType]
> [INFO] 
> /Users/nsb/Documents/personal/projects/nov26/hudi/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/debezium/TestMysqlDebeziumSource.java:31:
>  Missing a Javadoc comment. [JavadocType]
> [INFO] 
> /Users/nsb/Documents/personal/projects/nov26/hudi/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/debezium/TestPostgresDebeziumSource.java:31:
>  Missing a Javadoc comment. [JavadocType]
> [INFO] 
> /Users/nsb/Documents/personal/projects/nov26/hudi/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/debezium/TestAbstractDebeziumSource.java:58:
>  Missing a Javadoc comment. [JavadocType]
> [INFO] 
> /Users/nsb/Documents/personal/projects/nov26/hudi/h

[jira] [Updated] (HUDI-5079) Optimize rdd.isEmpty within DeltaSync

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5079:
-
Sprint: 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/10/18, 
2022/11/01, 2022/11/15)

> Optimize rdd.isEmpty within DeltaSync
> -
>
> Key: HUDI-5079
> URL: https://issues.apache.org/jira/browse/HUDI-5079
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>
> We are calling rdd.isEmpty for source rdd twice in DeltaSync. we should try 
> and optimize/reuse. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5075) Add support to rollback residual clustering after disabling clustering

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5075:
-
Sprint: 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/10/18, 
2022/11/01, 2022/11/15)

> Add support to rollback residual clustering after disabling clustering
> --
>
> Key: HUDI-5075
> URL: https://issues.apache.org/jira/browse/HUDI-5075
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: clustering
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>
> if a user enabled clustering and after sometime disabled it due to whatever 
> reason, there is a chance that there is a pending clustering left in the 
> timeline. But once clustering is disabled, this could just be lying around. 
> but this could affect metadata table compaction whcih in turn might affect 
> the data table archival. 
> so, we need a way to fix this. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5242) Do not fail Meta sync in Deltastreamer when inline table service fails

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5242:
-
Sprint: 2022/11/15, 2022/11/29  (was: 2022/11/15)

> Do not fail Meta sync in Deltastreamer when inline table service fails
> --
>
> Key: HUDI-5242
> URL: https://issues.apache.org/jira/browse/HUDI-5242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: clustering, compaction, deltastreamer
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: pull-request-available
>
> When table services fail in deltastreamer, meta sync will not occur. This can 
> cause missing partitions in the metaserver because even though the data is 
> written, the new partitions will never sync to the metaserver



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5101) Adding spark structured streaming tests to integ tests

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5101:
-
Sprint: 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/10/18, 
2022/11/01, 2022/11/15)

> Adding spark structured streaming tests to integ tests
> --
>
> Key: HUDI-5101
> URL: https://issues.apache.org/jira/browse/HUDI-5101
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5210) End-to-end PoC of index function

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5210:
-
Sprint: 2022/11/15, 2022/11/29  (was: 2022/11/15)

> End-to-end PoC of index function
> 
>
> Key: HUDI-5210
> URL: https://issues.apache.org/jira/browse/HUDI-5210
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5078) When applying changes to MDT, any replace commit is considered a table service

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5078:
-
Sprint: 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/10/18, 
2022/11/01, 2022/11/15)

> When applying changes to MDT, any replace commit is considered a table service
> --
>
> Key: HUDI-5078
> URL: https://issues.apache.org/jira/browse/HUDI-5078
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>
> Table services in metadata table can only be invoked by non table service 
> operations from data table. in other words, compaction, clustering from data 
> table cannot trigger compaction in MDT. 
> but we mistakenly considered any replace commit as a table service. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5080) UnpersistRdds unpersist all rdds in the spark context

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5080:
-
Sprint: 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/10/18, 
2022/11/01, 2022/11/15)

> UnpersistRdds unpersist all rdds in the spark context
> -
>
> Key: HUDI-5080
> URL: https://issues.apache.org/jira/browse/HUDI-5080
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>
> In SparkRDDWriteClient, we have a method to clean up persisted Rdds to free 
> up the space occupied. 
> [https://github.com/apache/hudi/blob/b78c3441c4e28200abec340eaff852375764cbdb/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java#L584]
> But the issue is, it cleans up all persisted rdds in the given spark context. 
> This will impact, async compaction or any other async table services running. 
> or even if there are multiple streams writing to different tables, this will 
> be cause a huge impact. 
>  
> This also needs to be fixed with DeltaSync. 
> [https://github.com/apache/hudi/blob/b78c3441c4e28200abec340eaff852375764cbdb/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L345]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5148) Write RFC for index function

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5148:
-
Sprint: 2022/11/15, 2022/11/29  (was: 2022/11/15)

> Write RFC for index function
> 
>
> Key: HUDI-5148
> URL: https://issues.apache.org/jira/browse/HUDI-5148
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5109) Source all metadata table instability issues

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5109:
-
Sprint: 2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/11/01, 2022/11/15)

> Source all metadata table instability issues
> 
>
> Key: HUDI-5109
> URL: https://issues.apache.org/jira/browse/HUDI-5109
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.2
>
>
> Lets collect all issues related to metadata table stability. for eg, out of 
> sync issues w/ data table, missing data or corrupt etc. data table not making 
> progress due to some nuances around metadata table. 
>  
> Let's comb through slack, github issues, github jiras and document it here. 
> We might need to make it robust and close out all the gaps. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4411) Bump Spark version to 3.2.2

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4411:
-
Sprint: 2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/11/01, 2022/11/15)

> Bump Spark version to 3.2.2
> ---
>
> Key: HUDI-4411
> URL: https://issues.apache.org/jira/browse/HUDI-4411
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Luning Wang
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> [https://lists.apache.org/list?d...@spark.apache.org:lte=1M:CVE-2022-33891]
> https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-33891



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5023) Evaluate removing Queueing in the write path

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5023:
-
Sprint: 2022/11/15, 2022/11/29  (was: 2022/11/15)

> Evaluate removing Queueing in the write path
> 
>
> Key: HUDI-5023
> URL: https://issues.apache.org/jira/browse/HUDI-5023
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> We should evaluate removing _any queueing_ (BoundedInMemoryQueue, 
> DisruptorQueue) on the write path for multiple reasons:
> *It breaks up vertical chain of transformations applied to data*
> Spark (alas other engines) rely on the notion of _Iteration_ to vertically 
> compose all transformations applied to a single record to allow for effective 
> _stream_ processing, where all transformations are applied to an _Iterator, 
> yielding records_ from the source, that way
>  # Chain of transformations* is applied to every record one by one, allowing 
> to effectively limit amount of memory used to the number of records being 
> read and processed simultaneously (if the reading is not batched, it'd be 
> just a single record), which in turn allows
>  # To limit # of memory allocations required to process a single record. 
> Consider the opposite: if we'd do it breadth-wise, applying first 
> transformation to _all_ of the records, we will have to store all of 
> transformed records in memory which is costly from both GC overhead as well 
> as pure object churn perspectives.
>  
> Enqueueing is essentially violates both of these invariants, breaking up 
> {_}stream{_}-like processing model and forcing records to be kept in memory 
> for no good reason.
>  
> * This chain is broken up at shuffling points (collection of tasks executed 
> b/w these shuffling points are called stages in Spark)
>  
> *It requires data to be allocated on the heap*
> As was called out in the previous paragraph, enqueueing raw data read from 
> the source breaks up _stream_ processing paradigm and forces records to be 
> persisted in the heap.
> Consider following example: plain ParquetReader from Spark actually uses 
> *mutable* `ColumnarBatchRow` providing a Row-based view into the batch of 
> data being read from the file.
> Now, since it's a mutable object we can use it to _iterate_ over all of the 
> records (while doing stream-processing) ultimately producing some "output" 
> (either writing into another file, shuffle block, etc), but we +can't keep a 
> reference on it+ (for ex, by +enqueueing+ it) – since the object is mutable. 
> Instead we are forced to make a *copy* of it, which will obviously require us 
> to allocate it on the heap.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5213) Support index function for Spark SQL built-in functions

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5213:
-
Sprint: 2022/11/15, 2022/11/29  (was: 2022/11/15)

> Support index function for Spark SQL built-in functions 
> 
>
> Key: HUDI-5213
> URL: https://issues.apache.org/jira/browse/HUDI-5213
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5238:
-
Sprint: 2022/11/15, 2022/11/29  (was: 2022/11/15)

> Hudi throwing "PipeBroken" exception during Merging on GCS
> --
>
> Key: HUDI-5238
> URL: https://issues.apache.org/jira/browse/HUDI-5238
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Originally reported at [https://github.com/apache/hudi/issues/7234]
> ---
>  
> Root-cause:
> Basically, the reason it’s failing is following: # GCS uses 
> PipeInputStream/PipeOutputStream comprising reading/writing ends of the 
> “pipe” it’s using for unidirectional comm b/w Threads
>  # PipeInputStream (for whatever reason) remembers the thread that actually 
> wrote into the pipe
>  # In BoundedInMemoryQueue we’re bootstrapping new executors (read, threads) 
> for reading and _writing_ (it’s only used in HoodieMergeHandle, and in 
> bulk-insert)
>  # When we’re done writing in HoodieMergeHelper, we’re shutting down *first* 
> BIMQ, then the HoodieMergeHandle, and that’s exactly the reason why it’s 
> failing
>  
> Issue has been introduced at [https://github.com/apache/hudi/pull/4264/files]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5209) Fix new Disruptor-based `QueueingExecutor` implementation

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5209:
-
Sprint: 2022/11/15, 2022/11/29  (was: 2022/11/15)

> Fix new Disruptor-based `QueueingExecutor` implementation
> -
>
> Key: HUDI-5209
> URL: https://issues.apache.org/jira/browse/HUDI-5209
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> In [https://github.com/apache/hudi/pull/5416] we've merged new 
> `QueueingExecutor` implementation based on LMAX Disruptor.
> We however missed a few things that need to be addressed:
>  * Missing adding Disruptor to our bundles
>  * Not wiring new executor into `HoodieMergeHelper`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4911) Make sure LogRecordReader doesn't flush the cache before each lookup

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4911:
-
Sprint: 2022/11/15, 2022/11/29  (was: 2022/11/15)

> Make sure LogRecordReader doesn't flush the cache before each lookup
> 
>
> Key: HUDI-4911
> URL: https://issues.apache.org/jira/browse/HUDI-4911
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently {{HoodieMetadataMergedLogRecordReader }}will flush internal record 
> cache before each lookup which makes every lookup essentially do 
> re-processing of the whole log-blocks stack again.
> We should avoid that and only do the re-parsing incrementally (for the keys 
> that ain't already cached)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4937) Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT readers

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4937:
-
Sprint: 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29  (was: 
2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15)

> Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT 
> readers
> -
>
> Key: HUDI-4937
> URL: https://issues.apache.org/jira/browse/HUDI-4937
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, writer-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently, `HoodieTable` is holding `HoodieBackedTableMetadata` that is setup 
> not to reuse actual LogScanner and HFileReader used to read MT itself.
> This is proving to be wasteful on a number of occasions already, including 
> (not an exhaustive list):
> https://github.com/apache/hudi/issues/6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5138) [Reader] Implement FileSliceReader (FSR) APIs

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5138:
-
Sprint: 2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/11/01, 2022/11/15)

> [Reader] Implement FileSliceReader (FSR) APIs
> -
>
> Key: HUDI-5138
> URL: https://issues.apache.org/jira/browse/HUDI-5138
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2740) Support for snapshot querying on MOR table

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2740:
-
Sprint: 2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/11/01, 2022/11/15)

> Support for snapshot querying on MOR table
> --
>
> Key: HUDI-2740
> URL: https://issues.apache.org/jira/browse/HUDI-2740
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5141) [Reader] Integrate metadata files and column_stats partitions in FileIndex

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5141:
-
Sprint: 2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/11/01, 2022/11/15)

> [Reader] Integrate metadata files and column_stats partitions in FileIndex
> --
>
> Key: HUDI-5141
> URL: https://issues.apache.org/jira/browse/HUDI-5141
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Alexey Kudinkin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5211) Add abstraction to track a function defined on a column

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5211:
-
Sprint: 2022/11/15, 2022/11/29  (was: 2022/11/15)

> Add abstraction to track a function defined on a column
> ---
>
> Key: HUDI-5211
> URL: https://issues.apache.org/jira/browse/HUDI-5211
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4142) RFC for new Table APIs proposal for query engine integrations

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4142:
-
Sprint: 2022/05/16, 2022/05/31, 2022/11/01, 2022/11/15, 2022/11/29  (was: 
2022/05/16, 2022/05/31, 2022/11/01, 2022/11/15)

> RFC for new Table APIs proposal for query engine integrations
> -
>
> Key: HUDI-4142
> URL: https://issues.apache.org/jira/browse/HUDI-4142
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Document all APIs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5212) Store index function in table properties

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5212:
-
Sprint: 2022/11/15, 2022/11/29  (was: 2022/11/15)

> Store index function in table properties
> 
>
> Key: HUDI-5212
> URL: https://issues.apache.org/jira/browse/HUDI-5212
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5214) Add functionality to create new MT partition for index function

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5214:
-
Sprint: 2022/11/15, 2022/11/29  (was: 2022/11/15)

> Add functionality to create new MT partition for index function
> ---
>
> Key: HUDI-5214
> URL: https://issues.apache.org/jira/browse/HUDI-5214
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4588) Ingestion failing if source column is dropped

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4588:
-
Sprint: 2022/08/08, 2022/08/22, 2022/09/05, 2022/10/04, 2022/10/18, 
2022/11/15, 2022/11/29  (was: 2022/08/08, 2022/08/22, 2022/09/05, 2022/10/04, 
2022/10/18, 2022/11/15)

> Ingestion failing if source column is dropped
> -
>
> Key: HUDI-4588
> URL: https://issues.apache.org/jira/browse/HUDI-4588
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Vamshi Gudavarthi
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available, schema, schema-evolution
> Fix For: 0.13.0
>
> Attachments: schema_stage1.avsc, schema_stage2.avsc, stage_1.json, 
> stage_2.json
>
>
> Ingestion using Deltastreamer fails if columns are dropped from source. I had 
> reproduced using docker-demo setup. Below are the steps for reproducing it.
>  # I had created data file `stage_1.json`(attached), ingested it to kafka and 
> ingested to hudi-table from kafka using Deltastreamer job(using 
> FileschemaProvider with `schema_stage1.avsc`)
>  # Simulating column dropping from source in the next step.
>  #  Repeat steps in step1 with stage2 files. Stage2 files doesn't have `day` 
> column, Ingestion job failed. Below is detailed stacktrace.
> {code:java}
> Driver stacktrace:
>     at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
>     at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>     at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
>     at scala.Option.foreach(Option.scala:257)
>     at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
>     at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>     at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
>     at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1098)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>     at org.apache.spark.rdd.RDD.fold(RDD.scala:1092)
>     at 
> org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply$mcD$sp(DoubleRDDFunctions.scala:35)
>     at 
> org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply(DoubleRDDFunctions.scala:35)
>     at 
> org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply(DoubleRDDFunctions.scala:35)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>     at 
> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:34)
>     at org.apache.spark.api.java.JavaDoubleRDD.sum(JavaDoubleRDD.scala:165)
>     at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:607)
>     at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:335)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:201)
>     at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:199)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:557)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
>

[jira] [Updated] (HUDI-5137) [Reader] Push down filters in FSR

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5137:
-
Sprint: 2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/11/01, 2022/11/15)

> [Reader] Push down filters in FSR
> -
>
> Key: HUDI-5137
> URL: https://issues.apache.org/jira/browse/HUDI-5137
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5136) [Reader] Project schemas in FileSliceReader (FSR)

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5136:
-
Sprint: 2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/11/01, 2022/11/15)

> [Reader] Project schemas in FileSliceReader (FSR) 
> --
>
> Key: HUDI-5136
> URL: https://issues.apache.org/jira/browse/HUDI-5136
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3529) Improve dependency management and bundling

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3529:
-
Sprint: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 
2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/08/22, 2022/09/05, 2022/09/19, 
2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15)

> Improve dependency management and bundling
> --
>
> Key: HUDI-3529
> URL: https://issues.apache.org/jira/browse/HUDI-3529
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: dependencies
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5216) Implement file-level stats update for index function

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5216:
-
Sprint: 2022/11/15, 2022/11/29  (was: 2022/11/15)

> Implement file-level stats update for index function
> 
>
> Key: HUDI-5216
> URL: https://issues.apache.org/jira/browse/HUDI-5216
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1574) Trim existing unit tests to finish in much shorter amount of time

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1574:
-
Sprint: 2022/08/22, 2022/09/05, 2022/09/19, 2022/10/04, 2022/10/18, 
2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/08/22, 2022/09/05, 2022/09/19, 
2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15)

> Trim existing unit tests to finish in much shorter amount of time
> -
>
> Key: HUDI-1574
> URL: https://issues.apache.org/jira/browse/HUDI-1574
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: Testing, tests-ci
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Priority: Critical
> Fix For: 0.13.0
>
>
> spark-client-tests
> 278.165 s - in org.apache.hudi.table.TestHoodieMergeOnReadTable
> 201.628 s - in org.apache.hudi.metadata.TestHoodieBackedMetadata
> 185.716 s - in org.apache.hudi.client.TestHoodieClientOnCopyOnWriteStorage
> 158.361 s - in org.apache.hudi.index.TestHoodieIndex
> 156.196 s - in org.apache.hudi.table.TestCleaner
> 132.369 s - in 
> org.apache.hudi.table.action.commit.TestCopyOnWriteActionExecutor
> 93.307 s - in org.apache.hudi.table.action.compact.TestAsyncCompaction
> 67.301 s - in org.apache.hudi.table.upgrade.TestUpgradeDowngrade
> 45.794 s - in org.apache.hudi.client.TestHoodieReadClient
> 38.615 s - in org.apache.hudi.index.bloom.TestHoodieBloomIndex
> 31.181 s - in org.apache.hudi.client.TestTableSchemaEvolution
> 20.072 s - in org.apache.hudi.table.action.compact.TestInlineCompaction
> grep " Time elapsed" hudi-client/hudi-spark-client/target/surefire-reports/* 
> | awk -F',' ' { print $5 } ' | awk -F':' ' { print $2 } ' | sort -nr | less
> hudi-utilities
> 209.936 s - in org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer
> 204.653 s - in 
> org.apache.hudi.utilities.functional.TestHoodieMultiTableDeltaStreamer
> 34.116 s - in org.apache.hudi.utilities.sources.TestKafkaSource
> 29.865 s - in org.apache.hudi.utilities.sources.TestParquetDFSSource
> 26.189 s - in 
> org.apache.hudi.utilities.sources.helpers.TestDatePartitionPathSelector
> Other Tests
> 42.595 s - in org.apache.hudi.common.functional.TestHoodieLogFormat
> 38.918 s - in org.apache.hudi.common.bootstrap.TestBootstrapIndex
> 22.046 s - in 
> org.apache.hudi.common.functional.TestHoodieLogFormatAppendFailure



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5051) Add a functional regression test for Bloom Index followed on w/ Upserts

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5051:
-
Sprint: 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/10/18, 
2022/11/01, 2022/11/15)

> Add a functional regression test for Bloom Index followed on w/ Upserts
> ---
>
> Key: HUDI-5051
> URL: https://issues.apache.org/jira/browse/HUDI-5051
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.12.2
>
>
> In the test
>  * State is initially bootstrapped by Bulk Insert (row-writing)
>  * Follow-up w/ upserts



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4990) Parallelize deduplication in CLI tool

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4990:
-
Sprint: 2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/11/01, 2022/11/15)

> Parallelize deduplication in CLI tool
> -
>
> Key: HUDI-4990
> URL: https://issues.apache.org/jira/browse/HUDI-4990
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Jonathan Vexler
>Priority: Major
> Fix For: 0.12.2
>
>
> The CLI tool command `repair deduplicate` repair one partition at a time.  To 
> repair hundreds of partitions, this takes time.  We should add a mode to take 
> multiple partition paths for the CLI and run the dedup job for multiple 
> partitions at the same time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5134) Implement PartitionSnapshot and PartitionDescriptor

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5134:
-
Sprint: 2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/11/01, 2022/11/15)

> Implement PartitionSnapshot and PartitionDescriptor
> ---
>
> Key: HUDI-5134
> URL: https://issues.apache.org/jira/browse/HUDI-5134
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5135) Abstract out generic FileIndex

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5135:
-
Sprint: 2022/11/01, 2022/11/15, 2022/11/29  (was: 2022/11/01, 2022/11/15)

> Abstract out generic FileIndex 
> ---
>
> Key: HUDI-5135
> URL: https://issues.apache.org/jira/browse/HUDI-5135
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> FileIndex has to be providing its full-suite of capabilities in a generic way:
>  * Partition Pruning
>  * Column Stats Pruning
>  * Caching
>  * etc
>  
> To support partition-pruning as well as col-stats pruning in an 
> engine-agnostic way, we'd have to implement our own Expression hierarchy 
> supporting
>  * Conversion from engine-specific one to Hudi's hierarchy
>  * Being able to execute expressions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5215) Support file pruning based on new index function in Spark

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5215:
-
Sprint: 2022/11/15, 2022/11/29  (was: 2022/11/15)

> Support file pruning based on new index function in Spark
> -
>
> Key: HUDI-5215
> URL: https://issues.apache.org/jira/browse/HUDI-5215
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-5291) NPE in collumn stats for null values

2022-11-28 Thread Sagar Sumit (Jira)

Sagar Sumit created HUDI-5291:
-

 Summary: NPE in collumn stats for null values
 Key: HUDI-5291
 URL: https://issues.apache.org/jira/browse/HUDI-5291
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Sagar Sumit
 Fix For: 0.12.2


[https://github.com/apache/hudi/issues/6936]



[This 
code|https://github.com/apache/hudi/blob/0d70df89fe6b7049d576e2b9bf75afb29c75c46d/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java#L147]
 can throw NPE from avro utils while processing null values. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] codope commented on issue #6936: [SUPPORT] NPE when trying to upsert with option hoodie.metadata.index.column.stats.enable : true.

2022-11-28 Thread GitBox



codope commented on issue #6936:
URL: https://github.com/apache/hudi/issues/6936#issuecomment-1330186251

   I think the issue is due to timestamp type field with null value. The reason 
it is not reproducible during first insert is that the records go through 
`HoodieCreateHandle` which does not merge column stats in the first insert. 
Upon subsequent upsert, records go through `HoodieAppendHandle` which attempts 
to merge column stats but then fails for timestamp type if value is null. See 
below script to repro:
   ```
   >>> from pyspark.sql.types import StructType,StructField, TimestampType, 
StringType
   >>> schema = StructType([
   ...   StructField('TimePeriod', StringType(), True),
   ...   StructField('StartTimeStamp', TimestampType(), True),
   ...   StructField('EndTimeStamp', TimestampType(), True)
   ... ])
   >>> import time
   >>> import datetime
   >>> timestamp = datetime.datetime.strptime('16:00:00:00',"%H:%M:%S:%f")
   >>> timestamp2 = datetime.datetime.strptime('18:59:59:59',"%H:%M:%S:%f")
   >>> columns = ['TimePeriod', 'StartTimeStamp', 'EndTimeStamp']
   >>> data = [("16:00:00:00 -> 18:59:59:59", timestamp, timestamp2 )]
   >>> df2 = spark.createDataFrame(data,schema)
   >>> df2.printSchema()
   root
|-- TimePeriod: string (nullable = true)
|-- StartTimeStamp: timestamp (nullable = true)
|-- EndTimeStamp: timestamp (nullable = true)
   
   >>> hudi_write_options_no_partition = {
   ... "hoodie.table.name": tableName,
   ... "hoodie.datasource.write.recordkey.field": "TimePeriod",
   ... 'hoodie.datasource.write.table.name': tableName,
   ... 'hoodie.datasource.write.precombine.field': 'EndTimeStamp',
   ... 'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
   ... 'hoodie.metadata.enable':'true',
   ... 'hoodie.metadata.index.bloom.filter.enable' : 'true',
   ... 'hoodie.metadata.index.column.stats.enable' : 'true'
   ... }
   >>> 
df2.write.format("org.apache.hudi").options(**hudi_write_options_no_partition).mode("overwrite").save(basePath)
   22/11/29 07:00:33 WARN config.DFSPropertiesConfiguration: Cannot find 
HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
   22/11/29 07:00:33 WARN config.DFSPropertiesConfiguration: Properties file 
file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
   22/11/29 07:00:34 WARN metadata.HoodieBackedTableMetadata: Metadata table 
was not found at path file:/tmp/hudi_trips_cow/.hoodie/metadata
   [Stage 7:>  (0 + 1) 
/ 1]
   
   // update data with non-null timestamp will succeed
   >>> data = [("16:00:00:00 -> 18:59:59:59", timestamp, 
datetime.datetime.strptime('19:59:59:59',"%H:%M:%S:%f"))]
   >>> updateDF = spark.createDataFrame(data,schema)
   >>> 
updateDF.write.format("org.apache.hudi").options(**hudi_write_options_no_partition).mode("append").save(basePath)
   
   // update data with null timestamp will throw exception
   >>> data = [("16:00:00:00 -> 18:59:59:59", timestamp, None)]
   >>> updateDF = spark.createDataFrame(data,schema)
   >>> 
updateDF.write.format("org.apache.hudi").options(**hudi_write_options_no_partition).mode("append").save(basePath)
   ```
   
   I would suggest to clean the data if possible, replace nulls in the 
dataframe by oldest unix timestamp or some default value that is suitable to 
usecase. Ideally, this should be handled in 
`HoodieTableMetadataUtil#collectColumnRangeMetadata`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] trushev commented on pull request #5830: [HUDI-3981][RFC-33] Flink engine support for comprehensive schema evolution

2022-11-28 Thread GitBox



trushev commented on PR #5830:
URL: https://github.com/apache/hudi/pull/5830#issuecomment-1330175541

   It looks like azure doesn't run on this PR anymore. Verifying PR is opened 
https://github.com/apache/hudi/pull/7321
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] trushev opened a new pull request, #7321: [HUDI-3981][RFC-33] Flink engine support for comprehensive schema evolution

2022-11-28 Thread GitBox



trushev opened a new pull request, #7321:
URL: https://github.com/apache/hudi/pull/7321

   ### Change Logs
   
   Azure doesn't run on the origin PR https://github.com/apache/hudi/pull/5830
   
   This PR adds support of reading by flink when comprehensive schema 
evolution(RFC-33) enabled and there are operations *add column*, *rename 
column*, *change type of column*, *drop column*.

   ### Impact
   
   user-facing feature change: comprehensive schema evolution in flink
   
   ### Risk level medium
   
   This change added tests and can be verified as follows:
 - Added unit test `TestCastMap` to verify that type conversion is correct
 - Added integration test `ITTestSchemaEvolution` to verify that table with 
added, renamed, casted, dropped columns is read as expected.
   
   ### Documentation Update
   There is schema evolution doc 
[https://hudi.apache.org/docs/schema_evolution](url)
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-5289) WriteStatus RDD is recalculated in cluster

2022-11-28 Thread zouxxyy (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zouxxyy closed HUDI-5289.
-
Resolution: Fixed

> WriteStatus RDD is recalculated in cluster
> --
>
> Key: HUDI-5289
> URL: https://issues.apache.org/jira/browse/HUDI-5289
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: zouxxyy
>Priority: Major
> Attachments: image-2022-11-29-10-24-08-853.png, 
> image-2022-11-29-10-25-29-546.png, image-2022-11-29-10-26-22-050.png
>
>
> Step:
> {code:java}
> spark-submit \
> --class org.apache.hudi.utilities.HoodieClusteringJob \
> --conf spark.driver.memory=40G \
> --conf spark.executor.instances=20 \
> --conf spark.executor.memory=40G \
> --conf spark.executor.cores=4 \
> hudi-utilities-bundle_2.11-0.12.0.jar \
> --props clusteringjob.properties \
> --mode scheduleAndExecute \
> --base-path xxx \
> --table-name xxx \
> --spark-memory 40g {code}
> The following are the two stages about the job, they are all related to the 
> calculation of WriteStatus, but some tasks in stage96 have been recalculated 
> which taking more than ten minutes
> !image-2022-11-29-10-24-08-853.png|width=1560,height=57!
> here is stage 65
> !image-2022-11-29-10-25-29-546.png|width=640,height=515!
> here is stage 96
> !image-2022-11-29-10-26-22-050.png|width=643,height=435!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] Zouxxyy commented on pull request #6561: [HUDI-4760] Fixing repeated trigger of data file creations w/ clustering

2022-11-28 Thread GitBox



Zouxxyy commented on PR #6561:
URL: https://github.com/apache/hudi/pull/6561#issuecomment-1330161724

   I also met this problem, and then also have a solution. 
   
   My solution is to use `writeMetadata.getWriteStats()` in 
`validateWriteResult` instead of `writeMetadata.getWriteStatuses()`
   
   Because getWriteStats will not trigger RDD recalculation
   
   https://user-images.githubusercontent.com/37108074/204458340-23cc5ef5-48b0-4fb1-86f7-b1a208be9986.png";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4847) hive sync fails w/ utilities bundle in 0.13-snapshot, but succeeds w/ 0.11

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4847:
-
Sprint: 2022/09/19, 2022/10/04, 2022/10/18, 2022/11/01, 2022/12/06  (was: 
2022/09/19, 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/29)

> hive sync fails w/ utilities bundle in 0.13-snapshot, but succeeds w/ 0.11
> --
>
> Key: HUDI-4847
> URL: https://issues.apache.org/jira/browse/HUDI-4847
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: meta-sync
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Major
>
> I used to run HiveSyncTool direclty as spark job to trigger hive sync and I 
> use hudi-utilities bundle for it. this used to work with 0.11 hudi jars. but 
> with 0.13-snapshot, it fails. I use hms based sync. 
>  
> sample job configs used:
> {code:java}
> arguments:
>   - --database
>   - test_db
>   - --table
>   - catalog_sales
>   - --base-path
>   - [TABLE_PATH]
>   - --user
>   - dbuser
>   - --pass
>   - dbuser
>   - --metastore-uris
>   - 'thrift://metastore123:9083'
>   - --sync-mode
>   - hms
>   - --partitioned-by
>   - cs_sold_date_sk
>   - --partition-value-extractor
>   - org.apache.hudi.hive.MultiPartKeysValueExtractor {code}
> Same exact configs works with 0.11 utilities bundle. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4847) hive sync fails w/ utilities bundle in 0.13-snapshot, but succeeds w/ 0.11

2022-11-28 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4847:
-
Fix Version/s: (was: 0.12.2)

> hive sync fails w/ utilities bundle in 0.13-snapshot, but succeeds w/ 0.11
> --
>
> Key: HUDI-4847
> URL: https://issues.apache.org/jira/browse/HUDI-4847
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: meta-sync
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Major
>
> I used to run HiveSyncTool direclty as spark job to trigger hive sync and I 
> use hudi-utilities bundle for it. this used to work with 0.11 hudi jars. but 
> with 0.13-snapshot, it fails. I use hms based sync. 
>  
> sample job configs used:
> {code:java}
> arguments:
>   - --database
>   - test_db
>   - --table
>   - catalog_sales
>   - --base-path
>   - [TABLE_PATH]
>   - --user
>   - dbuser
>   - --pass
>   - dbuser
>   - --metastore-uris
>   - 'thrift://metastore123:9083'
>   - --sync-mode
>   - hms
>   - --partitioned-by
>   - cs_sold_date_sk
>   - --partition-value-extractor
>   - org.apache.hudi.hive.MultiPartKeysValueExtractor {code}
> Same exact configs works with 0.11 utilities bundle. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] codope commented on issue #7209: [SUPPORT] Hudi deltastreamer fails due to Clean

2022-11-28 Thread GitBox



codope commented on issue #7209:
URL: https://github.com/apache/hudi/issues/7209#issuecomment-1330150927

   Also, to unblock the pipeline, i would suggest to disable the async cleaner 
temporarily. But, would really love to see what was going on around commit 
`20221115122551524`. Why wouldn't the commit metadata get deserialized.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6915: [HUDI-5004] Allow nested field as primary key and preCombineField in flink sql

2022-11-28 Thread GitBox



hudi-bot commented on PR #6915:
URL: https://github.com/apache/hudi/pull/6915#issuecomment-1330150865

   
   ## CI report:
   
   * 063adc8eb9514f64747f98b3cd1f28ddd30f430f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12120)
 
   * 7727723e5b09a86d950aaa20b1ac77187680e1f0 UNKNOWN
   * bc1999e36789fe7273456a3fac8b7ae067778b45 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13307)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] just-JL commented on pull request #7320: [HUDI-5290] remove the lock in #writeTableMetadata

2022-11-28 Thread GitBox



just-JL commented on PR #7320:
URL: https://github.com/apache/hudi/pull/7320#issuecomment-1330150472

   > 
   
   
   
   > Thanks for the contribution, can you rebase with the latest master and 
resolve the compile error, can you also add a UT for the guard lock for 
metadata table in `TestStreamWriteOperatorCoordinator`.
   
   ok, i will update later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on issue #7209: [SUPPORT] Hudi deltastreamer fails due to Clean

2022-11-28 Thread GitBox



codope commented on issue #7209:
URL: https://github.com/apache/hudi/issues/7209#issuecomment-1330149146

   Inspected the last clean instant in the above timeline but not hitting the 
same issue.
   ```
   scala> import org.apache.hudi.common.table.HoodieTableMetaClient
   import org.apache.hudi.common.table.HoodieTableMetaClient
   
   scala> import org.apache.hudi.common.table.timeline.TimelineMetadataUtils
   import org.apache.hudi.common.table.timeline.TimelineMetadataUtils
   
   scala> import org.apache.hudi.avro.model.HoodieCleanMetadata
   import org.apache.hudi.avro.model.HoodieCleanMetadata
   
   scala> val metaClient = 
HoodieTableMetaClient.builder().setConf(sc.hadoopConfiguration).setBasePath("file:///opt/issue-7209").build()
   metaClient: org.apache.hudi.common.table.HoodieTableMetaClient = 
HoodieTableMetaClient{basePath='file:/opt/issue-7209', 
metaPath='file:/opt/issue-7209/.hoodie', tableType=MERGE_ON_READ}
   
   scala> val lastCleanInstant = 
metaClient.getActiveTimeline().getCleanerTimeline().filterCompletedInstants().lastInstant()
   lastCleanInstant: 
org.apache.hudi.common.util.Option[org.apache.hudi.common.table.timeline.HoodieInstant]
 = Option{val=[20221110205549920__clean__COMPLETED]}
   
   scala> val cleanMetadata = 
TimelineMetadataUtils.deserializeHoodieCleanMetadata(metaClient.getActiveTimeline().getInstantDetails(lastCleanInstant.get()).get())
   cleanMetadata: org.apache.hudi.avro.model.HoodieCleanMetadata = 
{"startCleanTime": "20221110205549920", "timeTakenInMillis": 2354, 
"totalFilesDeleted": 2668, "earliestCommitToRetain": "20221110201240334", 
"lastCompletedCommitTimestamp": "", "partitionMetadata": 
{"year=2022/month=10/week=41/day=11/hour=06/app=mapy/os=android": 
{"partitionPath": 
"year=2022/month=10/week=41/day=11/hour=06/app=mapy/os=android", "policy": 
"KEEP_LATEST_COMMITS", "deletePathPatterns": 
["e4eff7fe-2c45-4a54-a64a-447f05c5c271-0_393-184-83092_20221110194231092.parquet"],
 "successDeleteFiles": 
["e4eff7fe-2c45-4a54-a64a-447f05c5c271-0_393-184-83092_20221110194231092.parquet"],
 "failedDeleteFiles": [], "isPartitionDeleted": false}, 
"year=2022/month=11/week=44/day=05/hour=18/app=sbrowser/os=android": 
{"partitionPath":...
   scala>
   ```
   
   Looks like the offending commit `20221115122551524` is not present in the 
attached timeline.
   Just before the error, we see following logs:
   ```
   2022-11-15 12:32:28,321 INFO fs.FSUtils: Removed directory at 
/hits/app/hudi_cileni/.hoodie/.temp/20221115122551524
   2022-11-15 12:32:28,321 INFO client.BaseHoodieWriteClient: Async cleaner has 
been spawned. Waiting for it to finish
   2022-11-15 12:32:28,321 INFO async.AsyncCleanerService: Waiting for async 
clean service to finish
   ```
   
   We need to inspect this commit. @koldic Do you have the above commit info?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7304: [HUDI-5278]support more conf to cluster procedure

2022-11-28 Thread GitBox



hudi-bot commented on PR #7304:
URL: https://github.com/apache/hudi/pull/7304#issuecomment-1330147640

   
   ## CI report:
   
   * 728191c9f32587415239c87cf89ce3efb6b42fb0 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13301)
 
   * 20a511d8ed5d22d8dd6fda83399b45f8e947d9c7 UNKNOWN
   * 3f8f9a48fadb41bf0c9e6449be942035e7b2d4d7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13306)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7264: [HUDI-5253] HoodieMergeOnReadTableInputFormat could have duplicate records issue if it contains delta files while still splittable

2022-11-28 Thread GitBox



hudi-bot commented on PR #7264:
URL: https://github.com/apache/hudi/pull/7264#issuecomment-1330147534

   
   ## CI report:
   
   * 1b07f271c16acd022d7853df55bb0a838b8b30de Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13177)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13205)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13213)
 
   * d6167ff53f72e8b630dd794c9d1568026e58fba1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13308)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7196: [HUDI-5279] move logic for deleting active instant to HoodieActiveTimeline

2022-11-28 Thread GitBox



hudi-bot commented on PR #7196:
URL: https://github.com/apache/hudi/pull/7196#issuecomment-1330147406

   
   ## CI report:
   
   * 3b1e3982daaa1118a96a63b54fe70bbc922eb4e4 UNKNOWN
   * 43684b3d65f627d92b1bfa61699e7d61983fe95a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13279)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13295)
 
   * a7d0518c591b55bfb318bf505b62ba8d8737ecfe Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13305)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6915: [HUDI-5004] Allow nested field as primary key and preCombineField in flink sql

2022-11-28 Thread GitBox



hudi-bot commented on PR #6915:
URL: https://github.com/apache/hudi/pull/6915#issuecomment-1330147096

   
   ## CI report:
   
   * 063adc8eb9514f64747f98b3cd1f28ddd30f430f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12120)
 
   * 7727723e5b09a86d950aaa20b1ac77187680e1f0 UNKNOWN
   * bc1999e36789fe7273456a3fac8b7ae067778b45 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] voonhous commented on a diff in pull request #6915: [HUDI-5004] Allow nested field as primary key and preCombineField in flink sql

2022-11-28 Thread GitBox



voonhous commented on code in PR #6915:
URL: https://github.com/apache/hudi/pull/6915#discussion_r1034344167


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java:
##
@@ -142,7 +143,7 @@ private void sanityCheck(Configuration conf, ResolvedSchema 
schema) {
 
 // validate pre_combine key
 String preCombineField = conf.get(FlinkOptions.PRECOMBINE_FIELD);
-if (!fields.contains(preCombineField)) {
+if (!fields.contains(getRootLevelFieldName(preCombineField))) {
   if (OptionsResolver.isDefaultHoodieRecordPayloadClazz(conf)) {

Review Comment:
   @danny0405 I have added the feature to validate nested fields for Flink. 
   
   Can you please help to review this PR again? 
   
   Thank you.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7304: [HUDI-5278]support more conf to cluster procedure

2022-11-28 Thread GitBox



hudi-bot commented on PR #7304:
URL: https://github.com/apache/hudi/pull/7304#issuecomment-1330143941

   
   ## CI report:
   
   * 728191c9f32587415239c87cf89ce3efb6b42fb0 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13301)
 
   * 20a511d8ed5d22d8dd6fda83399b45f8e947d9c7 UNKNOWN
   * 3f8f9a48fadb41bf0c9e6449be942035e7b2d4d7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7264: [HUDI-5253] HoodieMergeOnReadTableInputFormat could have duplicate records issue if it contains delta files while still splittable

2022-11-28 Thread GitBox



hudi-bot commented on PR #7264:
URL: https://github.com/apache/hudi/pull/7264#issuecomment-1330143854

   
   ## CI report:
   
   * 1b07f271c16acd022d7853df55bb0a838b8b30de Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13177)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13205)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13213)
 
   * d6167ff53f72e8b630dd794c9d1568026e58fba1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7196: [HUDI-5279] move logic for deleting active instant to HoodieActiveTimeline

2022-11-28 Thread GitBox



hudi-bot commented on PR #7196:
URL: https://github.com/apache/hudi/pull/7196#issuecomment-1330143705

   
   ## CI report:
   
   * 3b1e3982daaa1118a96a63b54fe70bbc922eb4e4 UNKNOWN
   * 43684b3d65f627d92b1bfa61699e7d61983fe95a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13279)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13295)
 
   * a7d0518c591b55bfb318bf505b62ba8d8737ecfe UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6915: [HUDI-5004] Allow nested field as primary key and preCombineField in flink sql

2022-11-28 Thread GitBox



hudi-bot commented on PR #6915:
URL: https://github.com/apache/hudi/pull/6915#issuecomment-1330143346

   
   ## CI report:
   
   * 063adc8eb9514f64747f98b3cd1f28ddd30f430f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12120)
 
   * 7727723e5b09a86d950aaa20b1ac77187680e1f0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7320: [HUDI-5290] remove the lock in #writeTableMetadata

2022-11-28 Thread GitBox



hudi-bot commented on PR #7320:
URL: https://github.com/apache/hudi/pull/7320#issuecomment-1330139435

   
   ## CI report:
   
   * 4f6c23b63efd79d7c9c488e232959623c8e346f5 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13303)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #7320: [HUDI-5290] remove the lock in #writeTableMetadata

2022-11-28 Thread GitBox



danny0405 commented on PR #7320:
URL: https://github.com/apache/hudi/pull/7320#issuecomment-1330139477

   Thanks for the contribution, can you rebase with the latest master and 
resolve the compile error,
   can you also add a UT for the guard lock for metadata table in 
`TestStreamWriteOperatorCoordinator`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7304: [HUDI-5278]support more conf to cluster procedure

2022-11-28 Thread GitBox



hudi-bot commented on PR #7304:
URL: https://github.com/apache/hudi/pull/7304#issuecomment-1330139340

   
   ## CI report:
   
   * e839fd6ee20e9586875e5666d9c79e177461ca13 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13284)
 
   * 728191c9f32587415239c87cf89ce3efb6b42fb0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13301)
 
   * 20a511d8ed5d22d8dd6fda83399b45f8e947d9c7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] boneanxs commented on a diff in pull request #7264: [HUDI-5253] HoodieMergeOnReadTableInputFormat could have duplicate records issue if it contains delta files while still splittable

2022-11-28 Thread GitBox



boneanxs commented on code in PR #7264:
URL: https://github.com/apache/hudi/pull/7264#discussion_r1034337202


##
hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/realtime/TestHoodieMergeOnReadTableInputFormat.java:
##
@@ -65,4 +66,16 @@ void pathNotSplitableForBootstrapScenario() throws 
IOException {
 rtPath.setPathWithBootstrapFileStatus(path);
 assertFalse(new HoodieMergeOnReadTableInputFormat().isSplitable(fs, 
rtPath), "Path for bootstrap should not be splitable.");
   }
+
+  @Test
+  void pathNotSplitableIfContainsDeltaFiles() throws IOException {
+URI basePath = Files.createTempFile(tempDir, "target", ".parquet").toUri();
+HoodieRealtimePath rtPath = new HoodieRealtimePath(new Path("foo"), "bar", 
basePath.toString(), Collections.emptyList(), "000", false, Option.empty());
+assertTrue(new HoodieMergeOnReadTableInputFormat().isSplitable(fs, 
rtPath));
+
+URI logPath = Files.createTempFile(tempDir, ".test", 
".log.4_1-149-180").toUri();
+HoodieLogFile logFile = new HoodieLogFile(fs.getFileStatus(new 
Path(logPath)));
+rtPath = new HoodieRealtimePath(new Path("foo"), "bar", 
basePath.toString(), Collections.singletonList(logFile), "000", false, 
Option.empty());
+assertFalse(new HoodieMergeOnReadTableInputFormat().isSplitable(fs, 
rtPath), "Path for bootstrap should not be splitable.");
+  }

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [MINOR] Fix IncrementalInputSplits compilation failure (#7319)

2022-11-28 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 0d70df89fe [MINOR] Fix IncrementalInputSplits compilation failure 
(#7319)
0d70df89fe is described below

commit 0d70df89fe6b7049d576e2b9bf75afb29c75c46d
Author: Alexander Trushev 
AuthorDate: Tue Nov 29 13:03:57 2022 +0700

[MINOR] Fix IncrementalInputSplits compilation failure (#7319)
---
 .../src/main/java/org/apache/hudi/source/IncrementalInputSplits.java| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java
index d12feb4e73..0a10c77c22 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java
@@ -492,7 +492,7 @@ public class IncrementalInputSplits implements Serializable 
{
 
 if (OptionsResolver.hasNoSpecificReadCommits(this.conf)) {
   // by default read from the latest commit
-  List instants = 
completedTimeline.getInstants().collect(Collectors.toList());
+  List instants = completedTimeline.getInstants();
   if (instants.size() > 1) {
 return Collections.singletonList(instants.get(instants.size() - 1));
   }

[GitHub] [hudi] danny0405 merged pull request #7319: [MINOR] Fix IncrementalInputSplits compilation failure

2022-11-28 Thread GitBox



danny0405 merged PR #7319:
URL: https://github.com/apache/hudi/pull/7319


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on issue #7294: [SUPPORT] Different keygen class assigned by Hudi in 0.11.1 and 0.12.1 while creating a table with multiple primary keys

2022-11-28 Thread GitBox



codope commented on issue #7294:
URL: https://github.com/apache/hudi/issues/7294#issuecomment-1330111435

   Attempted to reproduce with below script (note that I created table with 
0.11.1 and did insert with 0.12.1), the issue described by the OP is 
reproducible but upgrade doesn't override the keygen prop in hoodie.properties.
   ```
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.DataSourceWriteOptions
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceReadOptions
   import org.apache.hudi.QuickstartUtils._
   import org.apache.spark.sql.DataFrame
   
   spark.sql("""create table f_test111(
   id string,
   name string,
   age string,
   salary string,
   upd_ts string,
   job string
   )
   using hudi
   partitioned by (job)
   location 'file:///tmp/f_test111/'
   options (
   type = 'cow',
   primaryKey = 'id,name',
   preCombineField = 'upd_ts'
   )""")
   
   spark.sql("""insert into f_test111 values('a1', 'sagar', '32', '1000', 
'100', 'se')""")
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] KnightChess commented on pull request #7304: [HUDI-5278]support more conf to cluster procedure

2022-11-28 Thread GitBox



KnightChess commented on PR #7304:
URL: https://github.com/apache/hudi/pull/7304#issuecomment-1330087868

   I will reopen ci after #7319  be merged


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7320: [HUDI-5290] remove the lock in #writeTableMetadata

2022-11-28 Thread GitBox



hudi-bot commented on PR #7320:
URL: https://github.com/apache/hudi/pull/7320#issuecomment-1330087591

   
   ## CI report:
   
   * 4f6c23b63efd79d7c9c488e232959623c8e346f5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7319: [MINOR] Fix IncrementalInputSplits compilation failure

2022-11-28 Thread GitBox



hudi-bot commented on PR #7319:
URL: https://github.com/apache/hudi/pull/7319#issuecomment-1330087564

   
   ## CI report:
   
   * 6be36431d92312fa1a4ea0e55e9cb85b9083d94f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] just-JL commented on issue #7314: [SUPPORT] Caused by: java.lang.IllegalArgumentException: ALREADY_ACQUIRED

2022-11-28 Thread GitBox



just-JL commented on issue #7314:
URL: https://github.com/apache/hudi/issues/7314#issuecomment-1330085896

   @danny0405 Please take a look at this pr #7320. If there is a problem, I 
will update.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 3 4 >

1 - 100 of 309 matches

Mail list logo