[jira] [Closed] (HUDI-5656) Metadata Bootstrap flow resulting in NPE

2023-02-24 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-5656.
-
Resolution: Fixed

> Metadata Bootstrap flow resulting in NPE
> 
>
> Key: HUDI-5656
> URL: https://issues.apache.org/jira/browse/HUDI-5656
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap
>Affects Versions: 0.13.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> After adding a simple statement forcing the test to read whole bootstrapped 
> table:
> {code:java}
> sqlContext.sql("select * from bootstrapped").show(); {code}
>  
> Following NPE have been observed on master 
> (testBulkInsertsAndUpsertsWithBootstrap):
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 183.0 failed 1 times, most recent failure: Lost task 0.0 in stage 183.0 
> (TID 971, localhost, executor driver): java.lang.NullPointerException
>     at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_1$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>     at scala.collection.Iterator$$anon$10.next(Iterator.scala:448)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:256)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:836)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:836)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>     at org.apache.spark.scheduler.Task.run(Task.scala:123)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)Driver stacktrace:    at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:1889)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1877)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1876)
>     at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:59)
>     at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:52)
>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>     at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:926)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:926)
>     at scala.Option.foreach(Option.scala:257)
>     at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
>     at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>     at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
>     at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:365)
>     at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
>     at org.apache.s

[jira] [Updated] (HUDI-5656) Metadata Bootstrap flow resulting in NPE

2023-02-24 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5656:
--
Fix Version/s: 0.13.1
   (was: 0.14.0)

> Metadata Bootstrap flow resulting in NPE
> 
>
> Key: HUDI-5656
> URL: https://issues.apache.org/jira/browse/HUDI-5656
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap
>Affects Versions: 0.13.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> After adding a simple statement forcing the test to read whole bootstrapped 
> table:
> {code:java}
> sqlContext.sql("select * from bootstrapped").show(); {code}
>  
> Following NPE have been observed on master 
> (testBulkInsertsAndUpsertsWithBootstrap):
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 183.0 failed 1 times, most recent failure: Lost task 0.0 in stage 183.0 
> (TID 971, localhost, executor driver): java.lang.NullPointerException
>     at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_1$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>     at scala.collection.Iterator$$anon$10.next(Iterator.scala:448)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:256)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:836)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:836)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>     at org.apache.spark.scheduler.Task.run(Task.scala:123)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)Driver stacktrace:    at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:1889)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1877)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1876)
>     at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:59)
>     at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:52)
>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>     at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:926)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:926)
>     at scala.Option.foreach(Option.scala:257)
>     at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
>     at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>     at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
>     at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:365)
>     at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCo

[jira] [Updated] (HUDI-915) Partition Columns missing in files upserted after Metadata Bootstrap

2023-02-24 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-915:
-
Status: Patch Available  (was: In Progress)

> Partition Columns missing in files upserted after Metadata Bootstrap
> 
>
> Key: HUDI-915
> URL: https://issues.apache.org/jira/browse/HUDI-915
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Affects Versions: 0.13.0
>Reporter: Udit Mehrotra
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> This issue happens in when the source data is partitioned using _*hive-style 
> partitioning*_ which is also the default behavior of spark when it writes the 
> data. With this partitioning, the partition column/schema is never stored in 
> the files but instead retrieved on the fly from the file paths which have 
> partition folder in the form *_partition_key=partition_value_*.
> Now, during metadata bootstrap we store only the metadata columns in the hudi 
> table folder. Also the *bootstrap schema* we are computing directly reads 
> schema from the source data file which does not have the *partition column 
> schema* in it. Thus it is not complete.
> All this manifests into issues when we ultimately do *upserts* on these 
> bootstrapped files and they are fully bootstrapped. During upsert time the 
> schema evolves because the upsert dataframe needs to have partition column in 
> it for performing upserts. Thus ultimately the *upserted rows* have the 
> correct partition column value stored, while the other records which are 
> simply copied over from the metadata bootstrap file have missing partition 
> column in them. Thus, we observe a different behavior here with 
> *bootstrapped* vs *non-bootstrapped* tables.
> While this is not at the moment creating issues with *Hive* because it is 
> able to determine the partition columns becuase of all the metadata it 
> stores, however it creates a problem with other engines like *Spark* where 
> the partition columns will show up as *null* when the upserted files are read.
> Thus, the proposal is to fix the following issues:
>  * When performing bootstrap, figure out the partition schema and store it in 
> the *bootstrap schema* in the commit metadata file. This would provide the 
> following benefits:
>  ** From a completeness perspective this is good so that there is no 
> behavioral changes between bootstrapped vs non-bootstrapped tables.
>  ** In spark bootstrap relation and incremental query relation where we need 
> to figure out the latest schema, once can simply get the accurate schema from 
> the commit metadata file instead of having to determine whether or not 
> partition column is present in the schema obtained from the metadata file and 
> if not figure out the partition schema everytime and merge (which can be 
> expensive).
>  * When doing upsert on files that are metadata bootstrapped, the partition 
> column values should be correctly determined and copied to the upserted file 
> to avoid missing and null values.
>  ** Again this is consistent behavior with non-bootstrapped tables and even 
> though Hive seems to somehow handle this, we should consider other engines 
> like *Spark* where it cannot be automatically handled.
>  ** Without this it will be significantly more complicated to be able to 
> provide the partition value on read side in spark, to be able to determine 
> everytime whether partition value is null and somehow filling it in.
>  ** Once the table is fully bootstrapped at some point in future, and the 
> bootstrap commit is say cleaned up and spark querying happens through 
> *parquet* datasource instead of *new bootstrapped datasource*, the *parquet 
> datasource* will return null values wherever it find the missing partition 
> values. In that case, we have no control over the *parquet* datasource as it 
> is simply reading from the file. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-915) Partition Columns missing in files upserted after Metadata Bootstrap

2023-02-24 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-915.

Resolution: Fixed

> Partition Columns missing in files upserted after Metadata Bootstrap
> 
>
> Key: HUDI-915
> URL: https://issues.apache.org/jira/browse/HUDI-915
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Affects Versions: 0.13.0
>Reporter: Udit Mehrotra
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> This issue happens in when the source data is partitioned using _*hive-style 
> partitioning*_ which is also the default behavior of spark when it writes the 
> data. With this partitioning, the partition column/schema is never stored in 
> the files but instead retrieved on the fly from the file paths which have 
> partition folder in the form *_partition_key=partition_value_*.
> Now, during metadata bootstrap we store only the metadata columns in the hudi 
> table folder. Also the *bootstrap schema* we are computing directly reads 
> schema from the source data file which does not have the *partition column 
> schema* in it. Thus it is not complete.
> All this manifests into issues when we ultimately do *upserts* on these 
> bootstrapped files and they are fully bootstrapped. During upsert time the 
> schema evolves because the upsert dataframe needs to have partition column in 
> it for performing upserts. Thus ultimately the *upserted rows* have the 
> correct partition column value stored, while the other records which are 
> simply copied over from the metadata bootstrap file have missing partition 
> column in them. Thus, we observe a different behavior here with 
> *bootstrapped* vs *non-bootstrapped* tables.
> While this is not at the moment creating issues with *Hive* because it is 
> able to determine the partition columns becuase of all the metadata it 
> stores, however it creates a problem with other engines like *Spark* where 
> the partition columns will show up as *null* when the upserted files are read.
> Thus, the proposal is to fix the following issues:
>  * When performing bootstrap, figure out the partition schema and store it in 
> the *bootstrap schema* in the commit metadata file. This would provide the 
> following benefits:
>  ** From a completeness perspective this is good so that there is no 
> behavioral changes between bootstrapped vs non-bootstrapped tables.
>  ** In spark bootstrap relation and incremental query relation where we need 
> to figure out the latest schema, once can simply get the accurate schema from 
> the commit metadata file instead of having to determine whether or not 
> partition column is present in the schema obtained from the metadata file and 
> if not figure out the partition schema everytime and merge (which can be 
> expensive).
>  * When doing upsert on files that are metadata bootstrapped, the partition 
> column values should be correctly determined and copied to the upserted file 
> to avoid missing and null values.
>  ** Again this is consistent behavior with non-bootstrapped tables and even 
> though Hive seems to somehow handle this, we should consider other engines 
> like *Spark* where it cannot be automatically handled.
>  ** Without this it will be significantly more complicated to be able to 
> provide the partition value on read side in spark, to be able to determine 
> everytime whether partition value is null and somehow filling it in.
>  ** Once the table is fully bootstrapped at some point in future, and the 
> bootstrap commit is say cleaned up and spark querying happens through 
> *parquet* datasource instead of *new bootstrapped datasource*, the *parquet 
> datasource* will return null values wherever it find the missing partition 
> values. In that case, we have no control over the *parquet* datasource as it 
> is simply reading from the file. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-915) Partition Columns missing in files upserted after Metadata Bootstrap

2023-02-24 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin reassigned HUDI-915:


Assignee: Alexey Kudinkin  (was: Ethan Guo)

> Partition Columns missing in files upserted after Metadata Bootstrap
> 
>
> Key: HUDI-915
> URL: https://issues.apache.org/jira/browse/HUDI-915
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Common Core
>Affects Versions: 0.9.0
>Reporter: Udit Mehrotra
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> This issue happens in when the source data is partitioned using _*hive-style 
> partitioning*_ which is also the default behavior of spark when it writes the 
> data. With this partitioning, the partition column/schema is never stored in 
> the files but instead retrieved on the fly from the file paths which have 
> partition folder in the form *_partition_key=partition_value_*.
> Now, during metadata bootstrap we store only the metadata columns in the hudi 
> table folder. Also the *bootstrap schema* we are computing directly reads 
> schema from the source data file which does not have the *partition column 
> schema* in it. Thus it is not complete.
> All this manifests into issues when we ultimately do *upserts* on these 
> bootstrapped files and they are fully bootstrapped. During upsert time the 
> schema evolves because the upsert dataframe needs to have partition column in 
> it for performing upserts. Thus ultimately the *upserted rows* have the 
> correct partition column value stored, while the other records which are 
> simply copied over from the metadata bootstrap file have missing partition 
> column in them. Thus, we observe a different behavior here with 
> *bootstrapped* vs *non-bootstrapped* tables.
> While this is not at the moment creating issues with *Hive* because it is 
> able to determine the partition columns becuase of all the metadata it 
> stores, however it creates a problem with other engines like *Spark* where 
> the partition columns will show up as *null* when the upserted files are read.
> Thus, the proposal is to fix the following issues:
>  * When performing bootstrap, figure out the partition schema and store it in 
> the *bootstrap schema* in the commit metadata file. This would provide the 
> following benefits:
>  ** From a completeness perspective this is good so that there is no 
> behavioral changes between bootstrapped vs non-bootstrapped tables.
>  ** In spark bootstrap relation and incremental query relation where we need 
> to figure out the latest schema, once can simply get the accurate schema from 
> the commit metadata file instead of having to determine whether or not 
> partition column is present in the schema obtained from the metadata file and 
> if not figure out the partition schema everytime and merge (which can be 
> expensive).
>  * When doing upsert on files that are metadata bootstrapped, the partition 
> column values should be correctly determined and copied to the upserted file 
> to avoid missing and null values.
>  ** Again this is consistent behavior with non-bootstrapped tables and even 
> though Hive seems to somehow handle this, we should consider other engines 
> like *Spark* where it cannot be automatically handled.
>  ** Without this it will be significantly more complicated to be able to 
> provide the partition value on read side in spark, to be able to determine 
> everytime whether partition value is null and somehow filling it in.
>  ** Once the table is fully bootstrapped at some point in future, and the 
> bootstrap commit is say cleaned up and spark querying happens through 
> *parquet* datasource instead of *new bootstrapped datasource*, the *parquet 
> datasource* will return null values wherever it find the missing partition 
> values. In that case, we have no control over the *parquet* datasource as it 
> is simply reading from the file. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-915) Partition Columns missing in files upserted after Metadata Bootstrap

2023-02-24 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-915:
-
Affects Version/s: 0.13.0
   (was: 0.9.0)

> Partition Columns missing in files upserted after Metadata Bootstrap
> 
>
> Key: HUDI-915
> URL: https://issues.apache.org/jira/browse/HUDI-915
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Affects Versions: 0.13.0
>Reporter: Udit Mehrotra
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> This issue happens in when the source data is partitioned using _*hive-style 
> partitioning*_ which is also the default behavior of spark when it writes the 
> data. With this partitioning, the partition column/schema is never stored in 
> the files but instead retrieved on the fly from the file paths which have 
> partition folder in the form *_partition_key=partition_value_*.
> Now, during metadata bootstrap we store only the metadata columns in the hudi 
> table folder. Also the *bootstrap schema* we are computing directly reads 
> schema from the source data file which does not have the *partition column 
> schema* in it. Thus it is not complete.
> All this manifests into issues when we ultimately do *upserts* on these 
> bootstrapped files and they are fully bootstrapped. During upsert time the 
> schema evolves because the upsert dataframe needs to have partition column in 
> it for performing upserts. Thus ultimately the *upserted rows* have the 
> correct partition column value stored, while the other records which are 
> simply copied over from the metadata bootstrap file have missing partition 
> column in them. Thus, we observe a different behavior here with 
> *bootstrapped* vs *non-bootstrapped* tables.
> While this is not at the moment creating issues with *Hive* because it is 
> able to determine the partition columns becuase of all the metadata it 
> stores, however it creates a problem with other engines like *Spark* where 
> the partition columns will show up as *null* when the upserted files are read.
> Thus, the proposal is to fix the following issues:
>  * When performing bootstrap, figure out the partition schema and store it in 
> the *bootstrap schema* in the commit metadata file. This would provide the 
> following benefits:
>  ** From a completeness perspective this is good so that there is no 
> behavioral changes between bootstrapped vs non-bootstrapped tables.
>  ** In spark bootstrap relation and incremental query relation where we need 
> to figure out the latest schema, once can simply get the accurate schema from 
> the commit metadata file instead of having to determine whether or not 
> partition column is present in the schema obtained from the metadata file and 
> if not figure out the partition schema everytime and merge (which can be 
> expensive).
>  * When doing upsert on files that are metadata bootstrapped, the partition 
> column values should be correctly determined and copied to the upserted file 
> to avoid missing and null values.
>  ** Again this is consistent behavior with non-bootstrapped tables and even 
> though Hive seems to somehow handle this, we should consider other engines 
> like *Spark* where it cannot be automatically handled.
>  ** Without this it will be significantly more complicated to be able to 
> provide the partition value on read side in spark, to be able to determine 
> everytime whether partition value is null and somehow filling it in.
>  ** Once the table is fully bootstrapped at some point in future, and the 
> bootstrap commit is say cleaned up and spark querying happens through 
> *parquet* datasource instead of *new bootstrapped datasource*, the *parquet 
> datasource* will return null values wherever it find the missing partition 
> values. In that case, we have no control over the *parquet* datasource as it 
> is simply reading from the file. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-915) Partition Columns missing in files upserted after Metadata Bootstrap

2023-02-24 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-915:
-
Issue Type: Bug  (was: Task)

> Partition Columns missing in files upserted after Metadata Bootstrap
> 
>
> Key: HUDI-915
> URL: https://issues.apache.org/jira/browse/HUDI-915
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Affects Versions: 0.9.0
>Reporter: Udit Mehrotra
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> This issue happens in when the source data is partitioned using _*hive-style 
> partitioning*_ which is also the default behavior of spark when it writes the 
> data. With this partitioning, the partition column/schema is never stored in 
> the files but instead retrieved on the fly from the file paths which have 
> partition folder in the form *_partition_key=partition_value_*.
> Now, during metadata bootstrap we store only the metadata columns in the hudi 
> table folder. Also the *bootstrap schema* we are computing directly reads 
> schema from the source data file which does not have the *partition column 
> schema* in it. Thus it is not complete.
> All this manifests into issues when we ultimately do *upserts* on these 
> bootstrapped files and they are fully bootstrapped. During upsert time the 
> schema evolves because the upsert dataframe needs to have partition column in 
> it for performing upserts. Thus ultimately the *upserted rows* have the 
> correct partition column value stored, while the other records which are 
> simply copied over from the metadata bootstrap file have missing partition 
> column in them. Thus, we observe a different behavior here with 
> *bootstrapped* vs *non-bootstrapped* tables.
> While this is not at the moment creating issues with *Hive* because it is 
> able to determine the partition columns becuase of all the metadata it 
> stores, however it creates a problem with other engines like *Spark* where 
> the partition columns will show up as *null* when the upserted files are read.
> Thus, the proposal is to fix the following issues:
>  * When performing bootstrap, figure out the partition schema and store it in 
> the *bootstrap schema* in the commit metadata file. This would provide the 
> following benefits:
>  ** From a completeness perspective this is good so that there is no 
> behavioral changes between bootstrapped vs non-bootstrapped tables.
>  ** In spark bootstrap relation and incremental query relation where we need 
> to figure out the latest schema, once can simply get the accurate schema from 
> the commit metadata file instead of having to determine whether or not 
> partition column is present in the schema obtained from the metadata file and 
> if not figure out the partition schema everytime and merge (which can be 
> expensive).
>  * When doing upsert on files that are metadata bootstrapped, the partition 
> column values should be correctly determined and copied to the upserted file 
> to avoid missing and null values.
>  ** Again this is consistent behavior with non-bootstrapped tables and even 
> though Hive seems to somehow handle this, we should consider other engines 
> like *Spark* where it cannot be automatically handled.
>  ** Without this it will be significantly more complicated to be able to 
> provide the partition value on read side in spark, to be able to determine 
> everytime whether partition value is null and somehow filling it in.
>  ** Once the table is fully bootstrapped at some point in future, and the 
> bootstrap commit is say cleaned up and spark querying happens through 
> *parquet* datasource instead of *new bootstrapped datasource*, the *parquet 
> datasource* will return null values wherever it find the missing partition 
> values. In that case, we have no control over the *parquet* datasource as it 
> is simply reading from the file. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5835) spark cannot read mor table after execute update statement

2023-02-23 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5835:
--
Fix Version/s: 0.13.1

> spark cannot read mor table after execute update statement
> --
>
> Key: HUDI-5835
> URL: https://issues.apache.org/jira/browse/HUDI-5835
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.13.0
>Reporter: Tao Meng
>Assignee: Tao Meng
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> avro schema create by sparksql miss avro name and namespace, 
> This will lead the read schema and write schema of the log file to be 
> incompatible
>  
> {code:java}
> // code placeholder
>  spark.sql(
>s"""
>   |create table $tableName (
>   |  id int,
>   |  name string,
>   |  price double,
>   |  ts long,
>   |  ff decimal(38, 10)
>   |) using hudi
>   | location '${tablePath.toString}'
>   | tblproperties (
>   |  type = 'mor',
>   |  primaryKey = 'id',
>   |  preCombineField = 'ts'
>   | )
> """.stripMargin)
>  spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0")
> checkAnswer(s"select id, name, price, ts from $tableName")(
>   Seq(1, "a1", 10.0, 1000)
> )
> spark.sql(s"update $tableName set price = 22 where id = 1")
> checkAnswer(s"select id, name, price, ts from $tableName")(    failed
>   Seq(1, "a1", 22.0, 1000)
> )
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-5641) Streamline Advanced Schema Evolution flow

2023-02-22 Thread Alexey Kudinkin (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-5641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692465#comment-17692465
 ] 

Alexey Kudinkin commented on HUDI-5641:
---

[~xushiyan] good catch! It's actually

> Streamline Advanced Schema Evolution flow
> -
>
> Key: HUDI-5641
> URL: https://issues.apache.org/jira/browse/HUDI-5641
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.1
>
>
> Currently, Schema Evolution not always applied consistently and sometimes is 
> re-applied multiple times causing issues for HoodieSparkRecord 
> implementations (that is optimized to reuse underlying buffer):
>  # HoodieMergeHelper would apply SE transformer, then
>  # HoodieMergeHandle would run rewriteRecordWithNewSchema again



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5641) Streamline Advanced Schema Evolution flow

2023-02-22 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-5641.
-
Resolution: Fixed

> Streamline Advanced Schema Evolution flow
> -
>
> Key: HUDI-5641
> URL: https://issues.apache.org/jira/browse/HUDI-5641
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.1
>
>
> Currently, Schema Evolution not always applied consistently and sometimes is 
> re-applied multiple times causing issues for HoodieSparkRecord 
> implementations (that is optimized to reuse underlying buffer):
>  # HoodieMergeHelper would apply SE transformer, then
>  # HoodieMergeHandle would run rewriteRecordWithNewSchema again



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5514) Add support for auto generation of record keys for Hudi

2023-02-22 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin reassigned HUDI-5514:
-

Assignee: Alexey Kudinkin  (was: sivabalan narayanan)

> Add support for auto generation of record keys for Hudi
> ---
>
> Key: HUDI-5514
> URL: https://issues.apache.org/jira/browse/HUDI-5514
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Hudi has a requirement to set record keys for any given table. But for some 
> use-cases like ingesting log events, users may not have any column that can 
> act as a primary key field. So, would be good to expose an auto generation 
> record keys internally within hudi for such immutable use-cases. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5557) Wrong candidate files found in metadata table

2023-02-17 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-5557.
-
Resolution: Fixed

> Wrong candidate files found in metadata table 
> --
>
> Key: HUDI-5557
> URL: https://issues.apache.org/jira/browse/HUDI-5557
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata, spark-sql
>Affects Versions: 0.12.1
>Reporter: ruofan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.12.3
>
>
> Suppose the hudi table has five fields, but only two fields are indexed. When 
> part of the filter condition in SQL comes from index fields and the other 
> part comes from non-index fields, the candidate files queried from the 
> metadata table are wrong.
> For example following hudi table schema
> {code:java}
> name: varchar(128)
> age: int
> addr: varchar(128)
> city: varchar(32)
> job: varchar(32) {code}
> table properties
> {code:java}
> hoodie.table.type=MERGE_ON_READ
> hoodie.metadata.enable=true
> hoodie.metadata.index.column.stats.enable=true
> hoodie.metadata.index.column.stats.column.list='name,city'
> hoodie.enable.data.skipping=true {code}
> sql
> {code:java}
> select * from hudi_table where name='tom' and age=18;  {code}
> if we set hoodie.enable.data.skipping=false, the data can be found. But if we 
> set hoodie.enable.data.skipping=true, we can't find the expected data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5557) Wrong candidate files found in metadata table

2023-02-17 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5557:
--
Affects Version/s: 0.12.2
   (was: 0.12.1)

> Wrong candidate files found in metadata table 
> --
>
> Key: HUDI-5557
> URL: https://issues.apache.org/jira/browse/HUDI-5557
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata, spark-sql
>Affects Versions: 0.12.2
>Reporter: ruofan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.12.3
>
>
> Suppose the hudi table has five fields, but only two fields are indexed. When 
> part of the filter condition in SQL comes from index fields and the other 
> part comes from non-index fields, the candidate files queried from the 
> metadata table are wrong.
> For example following hudi table schema
> {code:java}
> name: varchar(128)
> age: int
> addr: varchar(128)
> city: varchar(32)
> job: varchar(32) {code}
> table properties
> {code:java}
> hoodie.table.type=MERGE_ON_READ
> hoodie.metadata.enable=true
> hoodie.metadata.index.column.stats.enable=true
> hoodie.metadata.index.column.stats.column.list='name,city'
> hoodie.enable.data.skipping=true {code}
> sql
> {code:java}
> select * from hudi_table where name='tom' and age=18;  {code}
> if we set hoodie.enable.data.skipping=false, the data can be found. But if we 
> set hoodie.enable.data.skipping=true, we can't find the expected data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5557) Wrong candidate files found in metadata table

2023-02-17 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5557:
--
Fix Version/s: 0.12.3

> Wrong candidate files found in metadata table 
> --
>
> Key: HUDI-5557
> URL: https://issues.apache.org/jira/browse/HUDI-5557
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata, spark-sql
>Affects Versions: 0.12.1
>Reporter: ruofan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.12.3
>
>
> Suppose the hudi table has five fields, but only two fields are indexed. When 
> part of the filter condition in SQL comes from index fields and the other 
> part comes from non-index fields, the candidate files queried from the 
> metadata table are wrong.
> For example following hudi table schema
> {code:java}
> name: varchar(128)
> age: int
> addr: varchar(128)
> city: varchar(32)
> job: varchar(32) {code}
> table properties
> {code:java}
> hoodie.table.type=MERGE_ON_READ
> hoodie.metadata.enable=true
> hoodie.metadata.index.column.stats.enable=true
> hoodie.metadata.index.column.stats.column.list='name,city'
> hoodie.enable.data.skipping=true {code}
> sql
> {code:java}
> select * from hudi_table where name='tom' and age=18;  {code}
> if we set hoodie.enable.data.skipping=false, the data can be found. But if we 
> set hoodie.enable.data.skipping=true, we can't find the expected data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5557) Wrong candidate files found in metadata table

2023-02-17 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5557:
--
Priority: Blocker  (was: Critical)

> Wrong candidate files found in metadata table 
> --
>
> Key: HUDI-5557
> URL: https://issues.apache.org/jira/browse/HUDI-5557
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata, spark-sql
>Affects Versions: 0.12.1
>Reporter: ruofan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> Suppose the hudi table has five fields, but only two fields are indexed. When 
> part of the filter condition in SQL comes from index fields and the other 
> part comes from non-index fields, the candidate files queried from the 
> metadata table are wrong.
> For example following hudi table schema
> {code:java}
> name: varchar(128)
> age: int
> addr: varchar(128)
> city: varchar(32)
> job: varchar(32) {code}
> table properties
> {code:java}
> hoodie.table.type=MERGE_ON_READ
> hoodie.metadata.enable=true
> hoodie.metadata.index.column.stats.enable=true
> hoodie.metadata.index.column.stats.column.list='name,city'
> hoodie.enable.data.skipping=true {code}
> sql
> {code:java}
> select * from hudi_table where name='tom' and age=18;  {code}
> if we set hoodie.enable.data.skipping=false, the data can be found. But if we 
> set hoodie.enable.data.skipping=true, we can't find the expected data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5815) Investigate flaky tests

2023-02-16 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5815:
--
Priority: Blocker  (was: Major)

> Investigate flaky tests
> ---
>
> Key: HUDI-5815
> URL: https://issues.apache.org/jira/browse/HUDI-5815
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.0
>
>
> TestHoodieDeltaStreamer.testHoodieAsyncClusteringJobWithScheduleAndExecute
> TestHoodieDeltaStreamer.testAsyncClusteringServiceWithCompaction
> TestHoodieDeltaStreamer.testUpsertsMORContinuousMode
> TestHoodieDeltaStreamer.testHoodieIndexer



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5815) Investigate flaky tests

2023-02-16 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5815:
--
Fix Version/s: 0.13.1
   (was: 0.13.0)

> Investigate flaky tests
> ---
>
> Key: HUDI-5815
> URL: https://issues.apache.org/jira/browse/HUDI-5815
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.1
>
>
> TestHoodieDeltaStreamer.testHoodieAsyncClusteringJobWithScheduleAndExecute
> TestHoodieDeltaStreamer.testAsyncClusteringServiceWithCompaction
> TestHoodieDeltaStreamer.testUpsertsMORContinuousMode
> TestHoodieDeltaStreamer.testHoodieIndexer



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5745) 0.13.0 release note part 3

2023-02-15 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-5745.
-
Resolution: Fixed

> 0.13.0 release note part 3
> --
>
> Key: HUDI-5745
> URL: https://issues.apache.org/jira/browse/HUDI-5745
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Migration Guide -> Breaking Changes -> Lazy File Index in Spark
> Migration Guide -> Breaking Changes -> Log4j Configuration
> Migration Guide -> Behavior Changes -> Simple Write Executor as Default
> Release Highlights -> Optimizing Record Payload handling
> Release Highlights -> Lazy File Index in Spark



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5602) Troubleshoot METADATA_ONLY bootstrapped table not being able to read back partition path

2023-02-15 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-5602.
-
Resolution: Fixed

> Troubleshoot METADATA_ONLY bootstrapped table not being able to read back 
> partition path
> 
>
> Key: HUDI-5602
> URL: https://issues.apache.org/jira/browse/HUDI-5602
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.2
>Reporter: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.1
>
>
> In [https://github.com/apache/hudi/pull/7461] after enabling matching of the 
> whole payload rather than just record counts, it's been discovered that Hudi 
> isn't able to read back partition-path after running METADATA_ONLY bootstrap, 
> leading to a test failure (it's annotated w/ the TODO and this Jira in the 
> test suite)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-5807) HoodieSparkParquetReader is not appending partition-path values

2023-02-15 Thread Alexey Kudinkin (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-5807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689276#comment-17689276
 ] 

Alexey Kudinkin commented on HUDI-5807:
---

We should do this by rebasing HoodieSparkFileReader onto ParquetFileFormat (to 
make sure we're creating readers same way as we do w/ Spark itself)
{code:java}
val parquetFileFormat = SparkAdapterSupport$.MODULE$.sparkAdapter()
// TODO this should be based on the table config
.createHoodieParquetFileFormat(true)
.get(); {code}

> HoodieSparkParquetReader is not appending partition-path values
> ---
>
> Key: HUDI-5807
> URL: https://issues.apache.org/jira/browse/HUDI-5807
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.13.0
>Reporter: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.1
>
>
> Current implementation of HoodieSparkParquetReader isn't supporting the case 
> when "hoodie.datasource.write.drop.partition.columns" is set to true.
> In that case partition-path values are expected to be parsed from 
> partition-path and be injected w/in the File Reader (this is behavior of 
> Spark's own readers)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5807) HoodieSparkParquetReader is not appending partition-path values

2023-02-15 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5807:
--
Affects Version/s: 0.13.0

> HoodieSparkParquetReader is not appending partition-path values
> ---
>
> Key: HUDI-5807
> URL: https://issues.apache.org/jira/browse/HUDI-5807
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.13.0
>Reporter: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.1
>
>
> Current implementation of HoodieSparkParquetReader isn't supporting the case 
> when "hoodie.datasource.write.drop.partition.columns" is set to true.
> In that case partition-path values are expected to be parsed from 
> partition-path and be injected w/in the File Reader (this is behavior of 
> Spark's own readers)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5807) HoodieSparkParquetReader is not appending partition-path values

2023-02-15 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5807:
-

 Summary: HoodieSparkParquetReader is not appending partition-path 
values
 Key: HUDI-5807
 URL: https://issues.apache.org/jira/browse/HUDI-5807
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark
Reporter: Alexey Kudinkin
 Fix For: 0.13.1


Current implementation of HoodieSparkParquetReader isn't supporting the case 
when "hoodie.datasource.write.drop.partition.columns" is set to true.

In that case partition-path values are expected to be parsed from 
partition-path and be injected w/in the File Reader (this is behavior of 
Spark's own readers)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5766) Flaky TestHoodieDeltaStreamer.testHoodieIndexer

2023-02-10 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin reassigned HUDI-5766:
-

Assignee: Sagar Sumit

> Flaky TestHoodieDeltaStreamer.testHoodieIndexer
> ---
>
> Key: HUDI-5766
> URL: https://issues.apache.org/jira/browse/HUDI-5766
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Sagar Sumit
>Priority: Major
>
> [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=15085&view=logs&j=9273dbcb-a208-5cd0-5541-8b9925cb3da0&t=55ef67e2-6f87-5746-60ad-7ddc97e935ed&s=ee3800fd-6e81-525f-e564-94108585217d]
>  
>  
> {code:java}
> [ERROR] testHoodieIndexer{HoodieRecordType}[2] Time elapsed: 75.513 s <<< 
> ERROR! 2023-02-10T17:32:46.5322187Z java.util.concurrent.ExecutionException: 
> java.lang.RuntimeException: org.apache.hudi.exception.HoodieException 
> 2023-02-10T17:32:46.5322924Z at 
> java.util.concurrent.FutureTask.report(FutureTask.java:122) 
> 2023-02-10T17:32:46.5323614Z at 
> java.util.concurrent.FutureTask.get(FutureTask.java:192) 
> 2023-02-10T17:32:46.5324413Z at 
> org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer.deltaStreamerTestRunner(TestHoodieDeltaStreamer.java:901)
>  2023-02-10T17:32:46.5325628Z at 
> org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer.deltaStreamerTestRunner(TestHoodieDeltaStreamer.java:884)
>  2023-02-10T17:32:46.5326567Z at 
> org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer.deltaStreamerTestRunner(TestHoodieDeltaStreamer.java:929)
>  2023-02-10T17:32:46.5327491Z at 
> org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer.testHoodieIndexer(TestHoodieDeltaStreamer.java:1163)
>  2023-02-10T17:32:46.5328245Z at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> 2023-02-10T17:32:46.5328902Z at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> 2023-02-10T17:32:46.5329661Z at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  2023-02-10T17:32:46.5330521Z at 
> java.lang.reflect.Method.invoke(Method.java:498) 2023-02-10T17:32:46.5331200Z 
> at 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
>  2023-02-10T17:32:46.5331985Z at 
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
>  2023-02-10T17:32:46.5332917Z at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
>  2023-02-10T17:32:46.5333787Z at 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
>  2023-02-10T17:32:46.5334594Z at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
>  2023-02-10T17:32:46.5335465Z at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:92)
>  2023-02-10T17:32:46.5336380Z at 
> org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
>  2023-02-10T17:32:46.5337275Z at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
>  2023-02-10T17:32:46.5338159Z at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
>  2023-02-10T17:32:46.5339082Z at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
>  2023-02-10T17:32:46.5340916Z at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
>  2023-02-10T17:32:46.5341867Z at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
>  2023-02-10T17:32:46.5343186Z at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
>  2023-02-10T17:32:46.5344979Z at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
>  2023-02-10T17:32:46.5345841Z at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:210)
>  2023-02-10T17:32:46.5346692Z at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
>  2023-02-10T17:32:46.5347498Z at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:206)
>  2023-02-10T17:32:46.5348309Z at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:131)
>  2023-02-10T17:32:46.5349095Z at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescript

[jira] [Created] (HUDI-5766) Flaky TestHoodieDeltaStreamer.testHoodieIndexer

2023-02-10 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5766:
-

 Summary: Flaky TestHoodieDeltaStreamer.testHoodieIndexer
 Key: HUDI-5766
 URL: https://issues.apache.org/jira/browse/HUDI-5766
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin


[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=15085&view=logs&j=9273dbcb-a208-5cd0-5541-8b9925cb3da0&t=55ef67e2-6f87-5746-60ad-7ddc97e935ed&s=ee3800fd-6e81-525f-e564-94108585217d]

 

 
{code:java}
[ERROR] testHoodieIndexer{HoodieRecordType}[2] Time elapsed: 75.513 s <<< 
ERROR! 2023-02-10T17:32:46.5322187Z java.util.concurrent.ExecutionException: 
java.lang.RuntimeException: org.apache.hudi.exception.HoodieException 
2023-02-10T17:32:46.5322924Z at 
java.util.concurrent.FutureTask.report(FutureTask.java:122) 
2023-02-10T17:32:46.5323614Z at 
java.util.concurrent.FutureTask.get(FutureTask.java:192) 
2023-02-10T17:32:46.5324413Z at 
org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer.deltaStreamerTestRunner(TestHoodieDeltaStreamer.java:901)
 2023-02-10T17:32:46.5325628Z at 
org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer.deltaStreamerTestRunner(TestHoodieDeltaStreamer.java:884)
 2023-02-10T17:32:46.5326567Z at 
org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer.deltaStreamerTestRunner(TestHoodieDeltaStreamer.java:929)
 2023-02-10T17:32:46.5327491Z at 
org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer.testHoodieIndexer(TestHoodieDeltaStreamer.java:1163)
 2023-02-10T17:32:46.5328245Z at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
2023-02-10T17:32:46.5328902Z at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
2023-02-10T17:32:46.5329661Z at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 2023-02-10T17:32:46.5330521Z at 
java.lang.reflect.Method.invoke(Method.java:498) 2023-02-10T17:32:46.5331200Z 
at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
 2023-02-10T17:32:46.5331985Z at 
org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
 2023-02-10T17:32:46.5332917Z at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
 2023-02-10T17:32:46.5333787Z at 
org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
 2023-02-10T17:32:46.5334594Z at 
org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
 2023-02-10T17:32:46.5335465Z at 
org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:92)
 2023-02-10T17:32:46.5336380Z at 
org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
 2023-02-10T17:32:46.5337275Z at 
org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
 2023-02-10T17:32:46.5338159Z at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
 2023-02-10T17:32:46.5339082Z at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
 2023-02-10T17:32:46.5340916Z at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
 2023-02-10T17:32:46.5341867Z at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
 2023-02-10T17:32:46.5343186Z at 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
 2023-02-10T17:32:46.5344979Z at 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
 2023-02-10T17:32:46.5345841Z at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:210)
 2023-02-10T17:32:46.5346692Z at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
 2023-02-10T17:32:46.5347498Z at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:206)
 2023-02-10T17:32:46.5348309Z at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:131)
 2023-02-10T17:32:46.5349095Z at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:65)
 2023-02-10T17:32:46.5349896Z at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$5(NodeTestTask.java:139)
 2023-02-10T17:32:46.5350679Z at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
 2023-02-10T17:32:46.5351476Z at 
org.junit.platform.engine.s

[jira] [Assigned] (HUDI-5765) Flaky TestHoodieCompactor.testSpillingWhenCompaction

2023-02-10 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin reassigned HUDI-5765:
-

Assignee: Danny Chen

> Flaky TestHoodieCompactor.testSpillingWhenCompaction
> 
>
> Key: HUDI-5765
> URL: https://issues.apache.org/jira/browse/HUDI-5765
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Danny Chen
>Priority: Major
>
> [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=15085&view=logs&j=600e7de6-e133-5e69-e615-50ee129b3c08&t=bbbd7bcc-ae73-56b8-887a-cd2d6deaafc7&s=859b8d9a-8fd6-5a5c-6f5e-f84f1990894e]
>  
> {code:java}
> [ERROR] Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 37.956 s <<< FAILURE! - in 
> org.apache.hudi.table.action.compact.TestHoodieCompactor
> [ERROR] testSpillingWhenCompaction  Time elapsed: 16.132 s  <<< FAILURE!
> org.opentest4j.AssertionFailedError: There should be 1 log file written for 
> every data file ==> expected: <1> but was: <0>
>   at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:55)
>   at 
> org.junit.jupiter.api.AssertionUtils.failNotEqual(AssertionUtils.java:62)
>   at 
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:166)
>   at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:643)
>   at 
> org.apache.hudi.table.action.compact.TestHoodieCompactor.assertLogFilesNumEqualsTo(TestHoodieCompactor.java:246)
>   at 
> org.apache.hudi.table.action.compact.TestHoodieCompactor.testSpillingWhenCompaction(TestHoodieCompactor.java:208)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
>   at 
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
>   at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
>   at 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
>   at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
>   at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:84)
>   at 
> org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
>   at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
>   at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
>   at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
>   at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
>   at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
>   at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
>   at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
>   at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:210)
>   at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
>   at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:206)
>   at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:131)
>   at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:65)
>   at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$5(NodeTestTask.java:139)
>   at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
>   at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$7(NodeTestTask.java:129)
>   at 
> org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
>   at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:127)
>  {code}



--
This message was sent by Atla

[jira] [Created] (HUDI-5765) Flaky TestHoodieCompactor.testSpillingWhenCompaction

2023-02-10 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5765:
-

 Summary: Flaky TestHoodieCompactor.testSpillingWhenCompaction
 Key: HUDI-5765
 URL: https://issues.apache.org/jira/browse/HUDI-5765
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin


[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=15085&view=logs&j=600e7de6-e133-5e69-e615-50ee129b3c08&t=bbbd7bcc-ae73-56b8-887a-cd2d6deaafc7&s=859b8d9a-8fd6-5a5c-6f5e-f84f1990894e]

 
{code:java}
[ERROR] Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 37.956 
s <<< FAILURE! - in org.apache.hudi.table.action.compact.TestHoodieCompactor
[ERROR] testSpillingWhenCompaction  Time elapsed: 16.132 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: There should be 1 log file written for 
every data file ==> expected: <1> but was: <0>
at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:55)
at 
org.junit.jupiter.api.AssertionUtils.failNotEqual(AssertionUtils.java:62)
at 
org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:166)
at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:643)
at 
org.apache.hudi.table.action.compact.TestHoodieCompactor.assertLogFilesNumEqualsTo(TestHoodieCompactor.java:246)
at 
org.apache.hudi.table.action.compact.TestHoodieCompactor.testSpillingWhenCompaction(TestHoodieCompactor.java:208)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
at 
org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
at 
org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
at 
org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
at 
org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:84)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:210)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:206)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:131)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:65)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$5(NodeTestTask.java:139)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$7(NodeTestTask.java:129)
at 
org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:127)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5759) Hudi do not support add column on mor table with log

2023-02-10 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5759:
--
Priority: Blocker  (was: Major)

> Hudi do not support add column on mor table with log
> 
>
> Key: HUDI-5759
> URL: https://issues.apache.org/jira/browse/HUDI-5759
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Qijun Fu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> We test the following sqls in the latest master branch 
> ```sql
> create table h0 (
>   id int,
>   name string,
>   price double,
>   ts long
> ) using hudi
>  options (
>   primaryKey ='id',
>   type = 'mor',
>   preCombineField = 'ts'
>  )
>  partitioned by(ts)
>  location '/tmp/h0';
> insert into h0 select 1, 'a1', 10, 1000;
> update h0 set price = 20 where id = 1;
> alter table h0 add column new_col1 int;
> update h0 set price = 22 where id = 1;
> select * from h0;
> ```
> And found that we can't read the table after add column and update. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5759) Hudi do not support add column on mor table with log

2023-02-10 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5759:
--
Fix Version/s: 0.13.1

> Hudi do not support add column on mor table with log
> 
>
> Key: HUDI-5759
> URL: https://issues.apache.org/jira/browse/HUDI-5759
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Qijun Fu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> We test the following sqls in the latest master branch 
> ```sql
> create table h0 (
>   id int,
>   name string,
>   price double,
>   ts long
> ) using hudi
>  options (
>   primaryKey ='id',
>   type = 'mor',
>   preCombineField = 'ts'
>  )
>  partitioned by(ts)
>  location '/tmp/h0';
> insert into h0 select 1, 'a1', 10, 1000;
> update h0 set price = 20 where id = 1;
> alter table h0 add column new_col1 int;
> update h0 set price = 22 where id = 1;
> select * from h0;
> ```
> And found that we can't read the table after add column and update. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5760) Make sure DeleteBlock doesn't use Kryo for serialization to disk

2023-02-09 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5760:
-

 Summary: Make sure DeleteBlock doesn't use Kryo for serialization 
to disk
 Key: HUDI-5760
 URL: https://issues.apache.org/jira/browse/HUDI-5760
 Project: Apache Hudi
  Issue Type: Bug
  Components: writer-core
Affects Versions: 1.0.0
Reporter: Alexey Kudinkin


The problem is that serialization of the `HoodieDeleteBlock` is generated 
dynamically by Kryo that could change whenever any class comprising it changes.

We've been bitten by this already twice:

HUDI-5758

HUDI-4959

 

Instead, anything that is persisted on disk have to be serialized using 
hard-coded methods (same way HoodieDataBlock are serailized)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5758) MOR table w/ delete block in 0.12.2 not readable in 0.13 and also not compactable

2023-02-09 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin reassigned HUDI-5758:
-

Assignee: Alexey Kudinkin

> MOR table w/ delete block in 0.12.2 not readable in 0.13 and also not 
> compactable
> -
>
> Key: HUDI-5758
> URL: https://issues.apache.org/jira/browse/HUDI-5758
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, writer-core
>Affects Versions: 0.13.0
>Reporter: sivabalan narayanan
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> If we have a Delete block in MOR log blocks in 0.12.2 hudi version, read from 
> 0.13.0 fails due to Kryo serialization/deser. In similar sense compaction 
> also does not work. 
>  
> Set of users who might be impacted w/ this:
> Those who are using MOR table and has 
> uncompacted file groups which has Delete blocks. 
> Delete blocks are possible only in following scenarios:
> a. Delete operation
> b. GLOBAL_INDEX + update partition path = true. Chances that it could result 
> in delete blocks. 
>  
> Root cause:
> HoodieKey was made KryoSerializable as part of RFC46, but guess missed to 
> register.
>  
> {code:java}
>  spark.sql("select * from hudi_trips_snapshot ").show(100, false)
> 23/02/09 16:53:43 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> 19:02  WARN: [kryo] Unable to load class 7e51db6-6033-4794-ac59-44a930424b2b 
> with kryo's ClassLoader. Retrying with current..
> 23/02/09 16:53:44 ERROR AbstractHoodieLogRecordReader: Got exception when 
> reading log file
> com.esotericsoftware.kryo.KryoException: Unable to find class: 
> 7e51db6-6033-4794-ac59-44a930424b2b
> Serialization trace:
> orderingVal (org.apache.hudi.common.model.DeleteRecord)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:118)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
>   at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:391)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:302)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
>   at 
> org.apache.hudi.common.util.SerializationUtils$KryoSerializerInstance.deserialize(SerializationUtils.java:100)
>   at 
> org.apache.hudi.common.util.SerializationUtils.deserialize(SerializationUtils.java:74)
>   at 
> org.apache.hudi.common.table.log.block.HoodieDeleteBlock.deserialize(HoodieDeleteBlock.java:106)
>   at 
> org.apache.hudi.common.table.log.block.HoodieDeleteBlock.getRecordsToDelete(HoodieDeleteBlock.java:91)
>   at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processQueuedBlocksForInstant(AbstractHoodieLogRecordReader.java:675)
>   at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecordReader.java:367)
>   at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:223)
>   at 
> org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:198)
>   at 
> org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:114)
>   at 
> org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:73)
>   at 
> org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:464)
>   at org.apache.hudi.LogFileIterator$.scanLog(Iterators.scala:326)
>   at org.apache.hudi.LogFileIterator.(Iterators.scala:91)
>   at org.apache.hudi.RecordMergingFileIterator.(Iterators.scala:172)
>   at 
> org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:100)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala

[jira] [Closed] (HUDI-5731) Fix com.google.common classes still being relocated in Hudi Spark bundle

2023-02-09 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-5731.
-
Resolution: Fixed

> Fix com.google.common classes still being relocated in Hudi Spark bundle
> 
>
> Key: HUDI-5731
> URL: https://issues.apache.org/jira/browse/HUDI-5731
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: dzcxzl
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> As originally reported in:
> [https://github.com/apache/hudi/pull/6240#issuecomment-1420149952]
>  
> The issue have been that after removal of Guava we still kept following 
> relocations configs in MR/Spark bundles:
> {code:java}
> 
>   com.google.common.
>   org.apache.hudi.com.google.common.
>  {code}
> Which in turn meant that all references from any class referencing Guava 
> would be shaded, even though Hudi isn't packaging Guava anymore. This might 
> result in following exception:
> {code:java}
> Caused by: java.lang.NoClassDefFoundError: 
> org/apache/hudi/com/google/common/base/Preconditions
>   at 
> org.apache.curator.ensemble.fixed.FixedEnsembleProvider.(FixedEnsembleProvider.java:39)
>   at 
> org.apache.curator.framework.CuratorFrameworkFactory$Builder.connectString(CuratorFrameworkFactory.java:193)
>   at 
> org.apache.kyuubi.ha.client.zookeeper.ZookeeperClientProvider$.buildZookeeperClient(ZookeeperClientProvider.scala:62)
>   at 
> org.apache.kyuubi.ha.client.zookeeper.ZookeeperDiscoveryClient.(ZookeeperDiscoveryClient.scala:65)
>   ... 45 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5731) Fix com.google.common classes still being relocated in Hudi Spark bundle

2023-02-09 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5731:
--
Sprint: Sprint 2023-01-31

> Fix com.google.common classes still being relocated in Hudi Spark bundle
> 
>
> Key: HUDI-5731
> URL: https://issues.apache.org/jira/browse/HUDI-5731
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: dzcxzl
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> As originally reported in:
> [https://github.com/apache/hudi/pull/6240#issuecomment-1420149952]
>  
> The issue have been that after removal of Guava we still kept following 
> relocations configs in MR/Spark bundles:
> {code:java}
> 
>   com.google.common.
>   org.apache.hudi.com.google.common.
>  {code}
> Which in turn meant that all references from any class referencing Guava 
> would be shaded, even though Hudi isn't packaging Guava anymore. This might 
> result in following exception:
> {code:java}
> Caused by: java.lang.NoClassDefFoundError: 
> org/apache/hudi/com/google/common/base/Preconditions
>   at 
> org.apache.curator.ensemble.fixed.FixedEnsembleProvider.(FixedEnsembleProvider.java:39)
>   at 
> org.apache.curator.framework.CuratorFrameworkFactory$Builder.connectString(CuratorFrameworkFactory.java:193)
>   at 
> org.apache.kyuubi.ha.client.zookeeper.ZookeeperClientProvider$.buildZookeeperClient(ZookeeperClientProvider.scala:62)
>   at 
> org.apache.kyuubi.ha.client.zookeeper.ZookeeperDiscoveryClient.(ZookeeperDiscoveryClient.scala:65)
>   ... 45 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5731) Fix com.google.common classes still being relocated in Hudi Spark bundle

2023-02-09 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5731:
--
Status: Patch Available  (was: In Progress)

> Fix com.google.common classes still being relocated in Hudi Spark bundle
> 
>
> Key: HUDI-5731
> URL: https://issues.apache.org/jira/browse/HUDI-5731
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: dzcxzl
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> As originally reported in:
> [https://github.com/apache/hudi/pull/6240#issuecomment-1420149952]
>  
> The issue have been that after removal of Guava we still kept following 
> relocations configs in MR/Spark bundles:
> {code:java}
> 
>   com.google.common.
>   org.apache.hudi.com.google.common.
>  {code}
> Which in turn meant that all references from any class referencing Guava 
> would be shaded, even though Hudi isn't packaging Guava anymore. This might 
> result in following exception:
> {code:java}
> Caused by: java.lang.NoClassDefFoundError: 
> org/apache/hudi/com/google/common/base/Preconditions
>   at 
> org.apache.curator.ensemble.fixed.FixedEnsembleProvider.(FixedEnsembleProvider.java:39)
>   at 
> org.apache.curator.framework.CuratorFrameworkFactory$Builder.connectString(CuratorFrameworkFactory.java:193)
>   at 
> org.apache.kyuubi.ha.client.zookeeper.ZookeeperClientProvider$.buildZookeeperClient(ZookeeperClientProvider.scala:62)
>   at 
> org.apache.kyuubi.ha.client.zookeeper.ZookeeperDiscoveryClient.(ZookeeperDiscoveryClient.scala:65)
>   ... 45 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5731) Fix com.google.common classes still being relocated in Hudi Spark bundle

2023-02-09 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5731:
--
Status: In Progress  (was: Open)

> Fix com.google.common classes still being relocated in Hudi Spark bundle
> 
>
> Key: HUDI-5731
> URL: https://issues.apache.org/jira/browse/HUDI-5731
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: dzcxzl
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> As originally reported in:
> [https://github.com/apache/hudi/pull/6240#issuecomment-1420149952]
>  
> The issue have been that after removal of Guava we still kept following 
> relocations configs in MR/Spark bundles:
> {code:java}
> 
>   com.google.common.
>   org.apache.hudi.com.google.common.
>  {code}
> Which in turn meant that all references from any class referencing Guava 
> would be shaded, even though Hudi isn't packaging Guava anymore. This might 
> result in following exception:
> {code:java}
> Caused by: java.lang.NoClassDefFoundError: 
> org/apache/hudi/com/google/common/base/Preconditions
>   at 
> org.apache.curator.ensemble.fixed.FixedEnsembleProvider.(FixedEnsembleProvider.java:39)
>   at 
> org.apache.curator.framework.CuratorFrameworkFactory$Builder.connectString(CuratorFrameworkFactory.java:193)
>   at 
> org.apache.kyuubi.ha.client.zookeeper.ZookeeperClientProvider$.buildZookeeperClient(ZookeeperClientProvider.scala:62)
>   at 
> org.apache.kyuubi.ha.client.zookeeper.ZookeeperDiscoveryClient.(ZookeeperDiscoveryClient.scala:65)
>   ... 45 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5731) Fix com.google.common classes still being relocated in Hudi Spark bundle

2023-02-08 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5731:
--
Description: 
As originally reported in:

[https://github.com/apache/hudi/pull/6240#issuecomment-1420149952]

 

The issue have been that after removal of Guava we still kept following 
relocations configs in MR/Spark bundles:
{code:java}

  com.google.common.
  org.apache.hudi.com.google.common.
 {code}
Which in turn meant that all references from any class referencing Guava would 
be shaded, even though Hudi isn't packaging Guava anymore. This might result in 
following exception:
{code:java}
Caused by: java.lang.NoClassDefFoundError: 
org/apache/hudi/com/google/common/base/Preconditions
at 
org.apache.curator.ensemble.fixed.FixedEnsembleProvider.(FixedEnsembleProvider.java:39)
at 
org.apache.curator.framework.CuratorFrameworkFactory$Builder.connectString(CuratorFrameworkFactory.java:193)
at 
org.apache.kyuubi.ha.client.zookeeper.ZookeeperClientProvider$.buildZookeeperClient(ZookeeperClientProvider.scala:62)
at 
org.apache.kyuubi.ha.client.zookeeper.ZookeeperDiscoveryClient.(ZookeeperDiscoveryClient.scala:65)
... 45 more {code}

  was:Configure guava relocation in Spark and MR bundle pom.xml, but there is 
no guava dependency, resulting in failure of guava-related class loading.


> Fix com.google.common classes still being relocated in Hudi Spark bundle
> 
>
> Key: HUDI-5731
> URL: https://issues.apache.org/jira/browse/HUDI-5731
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: dzcxzl
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
>
> As originally reported in:
> [https://github.com/apache/hudi/pull/6240#issuecomment-1420149952]
>  
> The issue have been that after removal of Guava we still kept following 
> relocations configs in MR/Spark bundles:
> {code:java}
> 
>   com.google.common.
>   org.apache.hudi.com.google.common.
>  {code}
> Which in turn meant that all references from any class referencing Guava 
> would be shaded, even though Hudi isn't packaging Guava anymore. This might 
> result in following exception:
> {code:java}
> Caused by: java.lang.NoClassDefFoundError: 
> org/apache/hudi/com/google/common/base/Preconditions
>   at 
> org.apache.curator.ensemble.fixed.FixedEnsembleProvider.(FixedEnsembleProvider.java:39)
>   at 
> org.apache.curator.framework.CuratorFrameworkFactory$Builder.connectString(CuratorFrameworkFactory.java:193)
>   at 
> org.apache.kyuubi.ha.client.zookeeper.ZookeeperClientProvider$.buildZookeeperClient(ZookeeperClientProvider.scala:62)
>   at 
> org.apache.kyuubi.ha.client.zookeeper.ZookeeperDiscoveryClient.(ZookeeperDiscoveryClient.scala:65)
>   ... 45 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5731) Fix com.google.common classes still being relocated in Hudi Spark bundle

2023-02-08 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5731:
--
Fix Version/s: 0.13.1

> Fix com.google.common classes still being relocated in Hudi Spark bundle
> 
>
> Key: HUDI-5731
> URL: https://issues.apache.org/jira/browse/HUDI-5731
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: dzcxzl
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> As originally reported in:
> [https://github.com/apache/hudi/pull/6240#issuecomment-1420149952]
>  
> The issue have been that after removal of Guava we still kept following 
> relocations configs in MR/Spark bundles:
> {code:java}
> 
>   com.google.common.
>   org.apache.hudi.com.google.common.
>  {code}
> Which in turn meant that all references from any class referencing Guava 
> would be shaded, even though Hudi isn't packaging Guava anymore. This might 
> result in following exception:
> {code:java}
> Caused by: java.lang.NoClassDefFoundError: 
> org/apache/hudi/com/google/common/base/Preconditions
>   at 
> org.apache.curator.ensemble.fixed.FixedEnsembleProvider.(FixedEnsembleProvider.java:39)
>   at 
> org.apache.curator.framework.CuratorFrameworkFactory$Builder.connectString(CuratorFrameworkFactory.java:193)
>   at 
> org.apache.kyuubi.ha.client.zookeeper.ZookeeperClientProvider$.buildZookeeperClient(ZookeeperClientProvider.scala:62)
>   at 
> org.apache.kyuubi.ha.client.zookeeper.ZookeeperDiscoveryClient.(ZookeeperDiscoveryClient.scala:65)
>   ... 45 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5731) Fix com.google.common classes still being relocated in Hudi Spark bundle

2023-02-08 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5731:
--
Summary: Fix com.google.common classes still being relocated in Hudi Spark 
bundle  (was: Add guava dependency to Spark and MR bundle)

> Fix com.google.common classes still being relocated in Hudi Spark bundle
> 
>
> Key: HUDI-5731
> URL: https://issues.apache.org/jira/browse/HUDI-5731
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: dzcxzl
>Priority: Critical
>  Labels: pull-request-available
>
> Configure guava relocation in Spark and MR bundle pom.xml, but there is no 
> guava dependency, resulting in failure of guava-related class loading.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5731) Fix com.google.common classes still being relocated in Hudi Spark bundle

2023-02-08 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin reassigned HUDI-5731:
-

Assignee: Alexey Kudinkin

> Fix com.google.common classes still being relocated in Hudi Spark bundle
> 
>
> Key: HUDI-5731
> URL: https://issues.apache.org/jira/browse/HUDI-5731
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: dzcxzl
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
>
> Configure guava relocation in Spark and MR bundle pom.xml, but there is no 
> guava dependency, resulting in failure of guava-related class loading.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5557) Wrong candidate files found in metadata table

2023-02-07 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5557:
--
Fix Version/s: 0.13.1

> Wrong candidate files found in metadata table 
> --
>
> Key: HUDI-5557
> URL: https://issues.apache.org/jira/browse/HUDI-5557
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata, spark-sql
>Affects Versions: 0.12.1
>Reporter: ruofan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> Suppose the hudi table has five fields, but only two fields are indexed. When 
> part of the filter condition in SQL comes from index fields and the other 
> part comes from non-index fields, the candidate files queried from the 
> metadata table are wrong.
> For example following hudi table schema
> {code:java}
> name: varchar(128)
> age: int
> addr: varchar(128)
> city: varchar(32)
> job: varchar(32) {code}
> table properties
> {code:java}
> hoodie.table.type=MERGE_ON_READ
> hoodie.metadata.enable=true
> hoodie.metadata.index.column.stats.enable=true
> hoodie.metadata.index.column.stats.column.list='name,city'
> hoodie.enable.data.skipping=true {code}
> sql
> {code:java}
> select * from hudi_table where name='tom' and age=18;  {code}
> if we set hoodie.enable.data.skipping=false, the data can be found. But if we 
> set hoodie.enable.data.skipping=true, we can't find the expected data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5557) Wrong candidate files found in metadata table

2023-02-07 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5557:
--
Priority: Critical  (was: Major)

> Wrong candidate files found in metadata table 
> --
>
> Key: HUDI-5557
> URL: https://issues.apache.org/jira/browse/HUDI-5557
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata, spark-sql
>Affects Versions: 0.12.1
>Reporter: ruofan
>Priority: Critical
>  Labels: pull-request-available
>
> Suppose the hudi table has five fields, but only two fields are indexed. When 
> part of the filter condition in SQL comes from index fields and the other 
> part comes from non-index fields, the candidate files queried from the 
> metadata table are wrong.
> For example following hudi table schema
> {code:java}
> name: varchar(128)
> age: int
> addr: varchar(128)
> city: varchar(32)
> job: varchar(32) {code}
> table properties
> {code:java}
> hoodie.table.type=MERGE_ON_READ
> hoodie.metadata.enable=true
> hoodie.metadata.index.column.stats.enable=true
> hoodie.metadata.index.column.stats.column.list='name,city'
> hoodie.enable.data.skipping=true {code}
> sql
> {code:java}
> select * from hudi_table where name='tom' and age=18;  {code}
> if we set hoodie.enable.data.skipping=false, the data can be found. But if we 
> set hoodie.enable.data.skipping=true, we can't find the expected data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5716) Fix Partitioners to avoid assuming that parallelism is always present

2023-02-06 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5716:
-

 Summary: Fix Partitioners to avoid assuming that parallelism is 
always present
 Key: HUDI-5716
 URL: https://issues.apache.org/jira/browse/HUDI-5716
 Project: Apache Hudi
  Issue Type: Bug
  Components: writer-core
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
 Fix For: 0.13.1


Currently, `Partitioner` impls assume that there's always going to be some 
parallelism level.

This has not been issue previously for the following reasons:
 * RDDs always have inherent "parallelism" level defined as the # of partitions 
they operating upon. However for Dataset (SparkPlan) that's not necessarily the 
case (som SparkPlans might not be reporting the output partitioning)
 * Additionally, we have had the default parallelism level set in our configs 
before which meant that we'd prefer that over the actual incoming dataset.

However, since we've recently removed default parallelism value from our 
configs we now need to fix Partitioners to make sure these are not assuming 
that parallelism is always going to be present.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2023-02-06 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-4261.
-
Resolution: Fixed

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.12.2
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2023-02-06 Thread Alexey Kudinkin (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684974#comment-17684974
 ] 

Alexey Kudinkin commented on HUDI-4261:
---

Unfortunately, OOM is a known side-effect of the this specific partitioner.

To remediate OOMs additional partitioners have been introduced by HUDI-5338

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.12.2
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2023-02-06 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4261:
--
Fix Version/s: 0.12.2
   (was: 0.13.0)

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.12.2
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-3883) Bulk-insert w/ sort-mode "NONE" leads to file-sizing issues

2023-02-06 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-3883.
-
Fix Version/s: 0.12.2
   (was: 0.13.1)
   Resolution: Fixed

> Bulk-insert w/ sort-mode "NONE" leads to file-sizing issues
> ---
>
> Key: HUDI-3883
> URL: https://issues.apache.org/jira/browse/HUDI-3883
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.2
>
> Attachments: Screen Shot 2022-04-14 at 1.08.19 PM.png
>
>
> Even after HUDI-3709, i still see that when writing partitioned-table 
> file-sizing doesn't seem to be properly respected: in that case i was running 
> ingestion job with following configs which was supposed to yield me ~100Mb 
> files
> {code:java}
> Map(
>   "hoodie.parquet.small.file.limit" -> String.valueOf(100 * 1024 * 1024),  // 
> 100Mb
>   "hoodie.parquet.max.file.size"-> String.valueOf(120 * 1024 * 1024)   // 
> 120Mb
> ) {code}
>  
> Instead, my table contains a lot of very small (~1Mb) files: 
> !Screen Shot 2022-04-14 at 1.08.19 PM.png|width=742,height=422!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2023-02-06 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4261:
--
Component/s: writer-core

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3883) Bulk-insert w/ sort-mode "NONE" leads to file-sizing issues

2023-02-06 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3883:
--
Fix Version/s: 0.13.1
   (was: 0.13.0)

> Bulk-insert w/ sort-mode "NONE" leads to file-sizing issues
> ---
>
> Key: HUDI-3883
> URL: https://issues.apache.org/jira/browse/HUDI-3883
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1
>
> Attachments: Screen Shot 2022-04-14 at 1.08.19 PM.png
>
>
> Even after HUDI-3709, i still see that when writing partitioned-table 
> file-sizing doesn't seem to be properly respected: in that case i was running 
> ingestion job with following configs which was supposed to yield me ~100Mb 
> files
> {code:java}
> Map(
>   "hoodie.parquet.small.file.limit" -> String.valueOf(100 * 1024 * 1024),  // 
> 100Mb
>   "hoodie.parquet.max.file.size"-> String.valueOf(120 * 1024 * 1024)   // 
> 120Mb
> ) {code}
>  
> Instead, my table contains a lot of very small (~1Mb) files: 
> !Screen Shot 2022-04-14 at 1.08.19 PM.png|width=742,height=422!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5698) Evaluate whether Hudi should sync partitions to HMS

2023-02-04 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5698:
--
Component/s: meta-sync

> Evaluate whether Hudi should sync partitions to HMS
> ---
>
> Key: HUDI-5698
> URL: https://issues.apache.org/jira/browse/HUDI-5698
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: meta-sync
>Reporter: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.1
>
>
> When syncing a Hudi table to HMS, we need to sync partition information by 
> calling `add_partitions`.  Such operation is expensive (40s per 1000 
> partitions).  We should check if this can be improved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5698) Evaluate whether Hudi should sync partitions to HMS

2023-02-04 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5698:
--
Summary: Evaluate whether Hudi should sync partitions to HMS  (was: 
Optimize sync to HMS in Hudi)

> Evaluate whether Hudi should sync partitions to HMS
> ---
>
> Key: HUDI-5698
> URL: https://issues.apache.org/jira/browse/HUDI-5698
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.1
>
>
> When syncing a Hudi table to HMS, we need to sync partition information by 
> calling `add_partitions`.  Such operation is expensive (40s per 1000 
> partitions).  We should check if this can be improved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5697) Spark SQL re-lists Hudi table after every SQL operations

2023-02-03 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5697:
-

 Summary: Spark SQL re-lists Hudi table after every SQL operations
 Key: HUDI-5697
 URL: https://issues.apache.org/jira/browse/HUDI-5697
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark, spark-sql
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
 Fix For: 0.13.1


Currently, after most DML operations in Spark SQL, Hudi invokes 
`Catalog.refreshTable`

Prior to Spark 3.2, this was essentially doing the following:
 # Invalidating relation cache (forcing next time for relation to be 
re-resolved, creating new FileIndex, listing files, etc)
 # Trigger cascading invalidation (re-caching) of the cached data (in 
CacheManager)

As of Spark 3.2 it now additionally does `LogicalRelation.refresh` for ALL 
tables (previously this was only done for Temporary Views), therefore entailing 
whole table to be re-listed again by triggering `FileIndex.refresh` which might 
be costly operation.

 

We should revert back to preceding behavior from Spark 3.1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5691) Fix HoodiePruneFileSourcePartition missing to list non-partitioned tables

2023-02-02 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5691:
-

 Summary: Fix HoodiePruneFileSourcePartition missing to list 
non-partitioned tables
 Key: HUDI-5691
 URL: https://issues.apache.org/jira/browse/HUDI-5691
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
 Fix For: 0.13.0


This results in this tables having incorrectly interpreted by Spark as empty (0 
bytes) in its CBO analysis



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5685) Fix performance gap in Bulk Insert row-writing path with enabled de-duplication

2023-02-01 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5685:
-

 Summary: Fix performance gap in Bulk Insert row-writing path with 
enabled de-duplication
 Key: HUDI-5685
 URL: https://issues.apache.org/jira/browse/HUDI-5685
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
 Fix For: 0.13.0


Currently, in case flag {{hoodie.combine.before.insert}} is set to true and 
{{hoodie.bulkinsert.sort.mode}} is set to {{{}NONE{}}}, Bulk Insert Row Writing 
performance will considerably degrade due to the following circumstances
 * During de-duplication (w/in {{{}dedupRows{}}}) records in the incoming RDD 
would be reshuffled (by Spark's default {{{}HashPartitioner{}}}) based on 
{{(partition-path, record-key)}} into N partitions
 * In case {{BulkInsertSortMode.NONE}} is used as partitioner, no 
re-partitioning will be performed and therefore each Spark task might be 
writing into M table partitions
 * This in turn entails explosion in the number of created (small) files, 
killing performance and table's layout



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5685) Fix performance gap in Bulk Insert row-writing path with enabled de-duplication

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5685:
--
Sprint: Sprint 2023-01-31

> Fix performance gap in Bulk Insert row-writing path with enabled 
> de-duplication
> ---
>
> Key: HUDI-5685
> URL: https://issues.apache.org/jira/browse/HUDI-5685
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Currently, in case flag {{hoodie.combine.before.insert}} is set to true and 
> {{hoodie.bulkinsert.sort.mode}} is set to {{{}NONE{}}}, Bulk Insert Row 
> Writing performance will considerably degrade due to the following 
> circumstances
>  * During de-duplication (w/in {{{}dedupRows{}}}) records in the incoming RDD 
> would be reshuffled (by Spark's default {{{}HashPartitioner{}}}) based on 
> {{(partition-path, record-key)}} into N partitions
>  * In case {{BulkInsertSortMode.NONE}} is used as partitioner, no 
> re-partitioning will be performed and therefore each Spark task might be 
> writing into M table partitions
>  * This in turn entails explosion in the number of created (small) files, 
> killing performance and table's layout



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5684) Fix CTAS to make combine-on-insert configurable

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5684:
--
Status: Patch Available  (was: In Progress)

> Fix CTAS to make combine-on-insert configurable
> ---
>
> Key: HUDI-5684
> URL: https://issues.apache.org/jira/browse/HUDI-5684
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently, CTAS sets `COMBINE_ON_INSERT` config value whenever target table 
> has pre-combine key specified.
> However, it's currently done in a way that doesn't allow it to be overridden 
> by user-provided configuration. We need to address that



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5684) Fix CTAS to make combine-on-insert configurable

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5684:
--
Status: In Progress  (was: Open)

> Fix CTAS to make combine-on-insert configurable
> ---
>
> Key: HUDI-5684
> URL: https://issues.apache.org/jira/browse/HUDI-5684
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently, CTAS sets `COMBINE_ON_INSERT` config value whenever target table 
> has pre-combine key specified.
> However, it's currently done in a way that doesn't allow it to be overridden 
> by user-provided configuration. We need to address that



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5681) Merge Into fails while deserializing expressions

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-5681.
-
Resolution: Fixed

> Merge Into fails while deserializing expressions
> 
>
> Key: HUDI-5681
> URL: https://issues.apache.org/jira/browse/HUDI-5681
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> While running our benchmark suite against 0.13 RC, we've stumbled upon 
> following exceptions:
> {code:java}
> 23/02/01 08:29:01 ERROR TaskSetManager: Task 1 in stage 947.0 failed 4 times; 
> aborting job
> 2023-02-01T08:29:01.219 ERROR: merge:1:inventory
> Job aborted due to stage failure: Task 1 in stage 947.0 failed 4 times, most 
> recent failure: Lost task 1.3 in stage 947.0 (TID 101955) 
> (ip-172-31-18-9.us-west-2.compute.internal executor 140): 
> org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
> UPDATE for partition :1
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:336)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleInsertPartition(BaseSparkCommitActionExecutor.java:342)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:253)
>   at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378)
>   at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1525)
>   at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:138)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 
> org.apache.spark.sql.catalyst.expressions.Literal
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804)
>   at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:221)
>   at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:199)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:408)
>   at org.apache.spark.sql.hudi.SerDeUtils$.toObject(SerDeUtils.scala:42)
>   at 
> org.apache.spark.sql.hudi.command.payload.ExpressionPayload$$anon$7.apply(ExpressionPayload.scala:423)
>   

[jira] [Updated] (HUDI-5684) Fix CTAS to make combine-on-insert configurable

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5684:
--
Sprint: Sprint 2023-01-31

> Fix CTAS to make combine-on-insert configurable
> ---
>
> Key: HUDI-5684
> URL: https://issues.apache.org/jira/browse/HUDI-5684
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Currently, CTAS sets `COMBINE_ON_INSERT` config value whenever target table 
> has pre-combine key specified.
> However, it's currently done in a way that doesn't allow it to be overridden 
> by user-provided configuration. We need to address that



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5684) Fix CTAS to make combine-on-insert configurable

2023-02-01 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5684:
-

 Summary: Fix CTAS to make combine-on-insert configurable
 Key: HUDI-5684
 URL: https://issues.apache.org/jira/browse/HUDI-5684
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
 Fix For: 0.13.0


Currently, CTAS sets `COMBINE_ON_INSERT` config value whenever target table has 
pre-combine key specified.

However, it's currently done in a way that doesn't allow it to be overridden by 
user-provided configuration. We need to address that



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5681) Merge Into fails while deserializing expressions

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5681:
--
Status: In Progress  (was: Open)

> Merge Into fails while deserializing expressions
> 
>
> Key: HUDI-5681
> URL: https://issues.apache.org/jira/browse/HUDI-5681
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> While running our benchmark suite against 0.13 RC, we've stumbled upon 
> following exceptions:
> {code:java}
> 23/02/01 08:29:01 ERROR TaskSetManager: Task 1 in stage 947.0 failed 4 times; 
> aborting job
> 2023-02-01T08:29:01.219 ERROR: merge:1:inventory
> Job aborted due to stage failure: Task 1 in stage 947.0 failed 4 times, most 
> recent failure: Lost task 1.3 in stage 947.0 (TID 101955) 
> (ip-172-31-18-9.us-west-2.compute.internal executor 140): 
> org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
> UPDATE for partition :1
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:336)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleInsertPartition(BaseSparkCommitActionExecutor.java:342)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:253)
>   at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378)
>   at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1525)
>   at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:138)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 
> org.apache.spark.sql.catalyst.expressions.Literal
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804)
>   at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:221)
>   at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:199)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:408)
>   at org.apache.spark.sql.hudi.SerDeUtils$.toObject(SerDeUtils.scala:42)
>   at 
> org.apache.spark.sql.hudi.command.payload.ExpressionPayload$$anon$7.apply(ExpressionPayload.scala:423)
>   at 
> org.apache.spark.sql.hu

[jira] [Updated] (HUDI-5681) Merge Into fails while deserializing expressions

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5681:
--
Status: Patch Available  (was: In Progress)

> Merge Into fails while deserializing expressions
> 
>
> Key: HUDI-5681
> URL: https://issues.apache.org/jira/browse/HUDI-5681
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> While running our benchmark suite against 0.13 RC, we've stumbled upon 
> following exceptions:
> {code:java}
> 23/02/01 08:29:01 ERROR TaskSetManager: Task 1 in stage 947.0 failed 4 times; 
> aborting job
> 2023-02-01T08:29:01.219 ERROR: merge:1:inventory
> Job aborted due to stage failure: Task 1 in stage 947.0 failed 4 times, most 
> recent failure: Lost task 1.3 in stage 947.0 (TID 101955) 
> (ip-172-31-18-9.us-west-2.compute.internal executor 140): 
> org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
> UPDATE for partition :1
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:336)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleInsertPartition(BaseSparkCommitActionExecutor.java:342)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:253)
>   at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378)
>   at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1525)
>   at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:138)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 
> org.apache.spark.sql.catalyst.expressions.Literal
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804)
>   at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:221)
>   at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:199)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:408)
>   at org.apache.spark.sql.hudi.SerDeUtils$.toObject(SerDeUtils.scala:42)
>   at 
> org.apache.spark.sql.hudi.command.payload.ExpressionPayload$$anon$7.apply(ExpressionPayload.scala:423)
>   at 
> org.apache.s

[jira] [Updated] (HUDI-4937) Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT readers

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4937:
--
Sprint: 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 
2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3  
(was: 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 2022/12/12, 
0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, Sprint 
2023-01-31)

> Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT 
> readers
> -
>
> Key: HUDI-4937
> URL: https://issues.apache.org/jira/browse/HUDI-4937
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, writer-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> Currently, `HoodieTable` is holding `HoodieBackedTableMetadata` that is setup 
> not to reuse actual LogScanner and HFileReader used to read MT itself.
> This is proving to be wasteful on a number of occasions already, including 
> (not an exhaustive list):
> https://github.com/apache/hudi/issues/6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5678) deduceShuffleParallelism Returns 0 when that should never happen

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5678:
--
Status: Patch Available  (was: In Progress)

> deduceShuffleParallelism Returns 0 when that should never happen
> 
>
> Key: HUDI-5678
> URL: https://issues.apache.org/jira/browse/HUDI-5678
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jonathan Vexler
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: image (1).png
>
>
> This test 
> {code:java}
>   forAll(BulkInsertSortMode.values().toList) { (sortMode: BulkInsertSortMode) 
> =>val sortModeName = sortMode.name()test(s"Test Bulk Insert with 
> BulkInsertSortMode: '$sortModeName'") {  withTempDir { basePath =>
> testBulkInsertPartitioner(basePath, sortModeName)  }}  }
>   def testBulkInsertPartitioner(basePath: File, sortModeName: String): Unit = 
> {val tableName = generateTableName//Remove these with [HUDI-5419]
> spark.sessionState.conf.unsetConf("hoodie.datasource.write.operation")
> spark.sessionState.conf.unsetConf("hoodie.datasource.write.insert.drop.duplicates")
> 
> spark.sessionState.conf.unsetConf("hoodie.merge.allow.duplicate.on.inserts")  
>   
> spark.sessionState.conf.unsetConf("hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled")
> //Default parallelism is 200 which means in global sort, each record will 
> end up in a different spark partition so//9 files would be created. 
> Setting parallelism to 3 so that each spark partition will contain a hudi 
> partition.val parallelism = if 
> (sortModeName.equals(BulkInsertSortMode.GLOBAL_SORT.name())) {  
> "hoodie.bulkinsert.shuffle.parallelism = 3,"} else {  ""}
> spark.sql(  s""" |create table $tableName ( |  id int,
>  |  name string, |  price double, |  dt string |) 
> using hudi | tblproperties ( |  primaryKey = 'id', |  
> preCombineField = 'name', |  type = 'cow', |  $parallelism
>  |  hoodie.bulkinsert.sort.mode = '$sortModeName' | ) | 
> partitioned by (dt) | location 
> '${basePath.getCanonicalPath}/$tableName'""".stripMargin)
> spark.sql("set hoodie.sql.bulk.insert.enable = true")spark.sql("set 
> hoodie.sql.insert.mode = non-strict")spark.sql(  s"""insert into 
> $tableName  values |(5, 'a', 35, '2021-05-21'), |(1, 'a', 31, 
> '2021-01-21'), |(3, 'a', 33, '2021-03-21'), |(4, 'b', 16, 
> '2021-05-21'), |(2, 'b', 18, '2021-01-21'), |(6, 'b', 17, 
> '2021-03-21'), |(8, 'a', 21, '2021-05-21'), |(9, 'a', 22, 
> '2021-01-21'), |(7, 'a', 23, '2021-03-21') |""".stripMargin)  
>   assertResult(3)(spark.sql(s"select distinct _hoodie_file_name from 
> $tableName").count())  } {code}
> Fails due to 
> {code:java}
> requirement failed: Number of partitions (0) must be positive.
> java.lang.IllegalArgumentException: requirement failed: Number of partitions 
> (0) must be positive.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Repartition.(basicLogicalOperators.scala:951)
>   at org.apache.spark.sql.Dataset.coalesce(Dataset.scala:2946)
>   at 
> org.apache.hudi.execution.bulkinsert.PartitionSortPartitionerWithRows.repartitionRecords(PartitionSortPartitionerWithRows.java:48)
>   at 
> org.apache.hudi.execution.bulkinsert.PartitionSortPartitionerWithRows.repartitionRecords(PartitionSortPartitionerWithRows.java:34)
>   at 
> org.apache.hudi.HoodieDatasetBulkInsertHelper$.prepareForBulkInsert(HoodieDatasetBulkInsertHelper.scala:124)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.bulkInsertAsRow(HoodieSparkSqlWriter.scala:763)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:239)
>   at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:107)
>   at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand.run(InsertIntoHoodieTableCommand.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
>   at org.apac

[jira] [Updated] (HUDI-4937) Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT readers

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4937:
--
Sprint: 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15, 2022/11/29, 
2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3, 
Sprint 2023-02-14  (was: 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15, 
2022/11/29, 2022/12/12, 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 
Final Sprint 3)

> Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT 
> readers
> -
>
> Key: HUDI-4937
> URL: https://issues.apache.org/jira/browse/HUDI-4937
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, writer-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> Currently, `HoodieTable` is holding `HoodieBackedTableMetadata` that is setup 
> not to reuse actual LogScanner and HFileReader used to read MT itself.
> This is proving to be wasteful on a number of occasions already, including 
> (not an exhaustive list):
> https://github.com/apache/hudi/issues/6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5633) Fixing HoodieSparkRecord performance bottlenecks

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-5633.
-
Resolution: Fixed

> Fixing HoodieSparkRecord performance bottlenecks
> 
>
> Key: HUDI-5633
> URL: https://issues.apache.org/jira/browse/HUDI-5633
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> There currently following issues w/ the current HoodieSparkRecord 
> implementation:
>  # It rewrites records using `rewriteRecord` and `rewriteRecordWithNewSchema` 
> which do Schema traversals for every record. Instead we should do schema 
> traversal only once and produce a transformer that will directly create new 
> record from the old one.
>  # Records are currently copied for every Executor even for Simple one which 
> actually is not buffering any records and therefore doesn't require records 
> to be copied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5678) deduceShuffleParallelism Returns 0 when that should never happen

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5678:
--
Status: In Progress  (was: Open)

> deduceShuffleParallelism Returns 0 when that should never happen
> 
>
> Key: HUDI-5678
> URL: https://issues.apache.org/jira/browse/HUDI-5678
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jonathan Vexler
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: image (1).png
>
>
> This test 
> {code:java}
>   forAll(BulkInsertSortMode.values().toList) { (sortMode: BulkInsertSortMode) 
> =>val sortModeName = sortMode.name()test(s"Test Bulk Insert with 
> BulkInsertSortMode: '$sortModeName'") {  withTempDir { basePath =>
> testBulkInsertPartitioner(basePath, sortModeName)  }}  }
>   def testBulkInsertPartitioner(basePath: File, sortModeName: String): Unit = 
> {val tableName = generateTableName//Remove these with [HUDI-5419]
> spark.sessionState.conf.unsetConf("hoodie.datasource.write.operation")
> spark.sessionState.conf.unsetConf("hoodie.datasource.write.insert.drop.duplicates")
> 
> spark.sessionState.conf.unsetConf("hoodie.merge.allow.duplicate.on.inserts")  
>   
> spark.sessionState.conf.unsetConf("hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled")
> //Default parallelism is 200 which means in global sort, each record will 
> end up in a different spark partition so//9 files would be created. 
> Setting parallelism to 3 so that each spark partition will contain a hudi 
> partition.val parallelism = if 
> (sortModeName.equals(BulkInsertSortMode.GLOBAL_SORT.name())) {  
> "hoodie.bulkinsert.shuffle.parallelism = 3,"} else {  ""}
> spark.sql(  s""" |create table $tableName ( |  id int,
>  |  name string, |  price double, |  dt string |) 
> using hudi | tblproperties ( |  primaryKey = 'id', |  
> preCombineField = 'name', |  type = 'cow', |  $parallelism
>  |  hoodie.bulkinsert.sort.mode = '$sortModeName' | ) | 
> partitioned by (dt) | location 
> '${basePath.getCanonicalPath}/$tableName'""".stripMargin)
> spark.sql("set hoodie.sql.bulk.insert.enable = true")spark.sql("set 
> hoodie.sql.insert.mode = non-strict")spark.sql(  s"""insert into 
> $tableName  values |(5, 'a', 35, '2021-05-21'), |(1, 'a', 31, 
> '2021-01-21'), |(3, 'a', 33, '2021-03-21'), |(4, 'b', 16, 
> '2021-05-21'), |(2, 'b', 18, '2021-01-21'), |(6, 'b', 17, 
> '2021-03-21'), |(8, 'a', 21, '2021-05-21'), |(9, 'a', 22, 
> '2021-01-21'), |(7, 'a', 23, '2021-03-21') |""".stripMargin)  
>   assertResult(3)(spark.sql(s"select distinct _hoodie_file_name from 
> $tableName").count())  } {code}
> Fails due to 
> {code:java}
> requirement failed: Number of partitions (0) must be positive.
> java.lang.IllegalArgumentException: requirement failed: Number of partitions 
> (0) must be positive.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Repartition.(basicLogicalOperators.scala:951)
>   at org.apache.spark.sql.Dataset.coalesce(Dataset.scala:2946)
>   at 
> org.apache.hudi.execution.bulkinsert.PartitionSortPartitionerWithRows.repartitionRecords(PartitionSortPartitionerWithRows.java:48)
>   at 
> org.apache.hudi.execution.bulkinsert.PartitionSortPartitionerWithRows.repartitionRecords(PartitionSortPartitionerWithRows.java:34)
>   at 
> org.apache.hudi.HoodieDatasetBulkInsertHelper$.prepareForBulkInsert(HoodieDatasetBulkInsertHelper.scala:124)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.bulkInsertAsRow(HoodieSparkSqlWriter.scala:763)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:239)
>   at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:107)
>   at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand.run(InsertIntoHoodieTableCommand.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
>   at org.apache.spark.sq

[jira] [Updated] (HUDI-5681) Merge Into fails while deserializing expressions

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5681:
--
Sprint: Sprint 2023-01-31

> Merge Into fails while deserializing expressions
> 
>
> Key: HUDI-5681
> URL: https://issues.apache.org/jira/browse/HUDI-5681
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> While running our benchmark suite against 0.13 RC, we've stumbled upon 
> following exceptions:
> {code:java}
> 23/02/01 08:29:01 ERROR TaskSetManager: Task 1 in stage 947.0 failed 4 times; 
> aborting job
> 2023-02-01T08:29:01.219 ERROR: merge:1:inventory
> Job aborted due to stage failure: Task 1 in stage 947.0 failed 4 times, most 
> recent failure: Lost task 1.3 in stage 947.0 (TID 101955) 
> (ip-172-31-18-9.us-west-2.compute.internal executor 140): 
> org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
> UPDATE for partition :1
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:336)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleInsertPartition(BaseSparkCommitActionExecutor.java:342)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:253)
>   at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378)
>   at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1525)
>   at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:138)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 
> org.apache.spark.sql.catalyst.expressions.Literal
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804)
>   at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:221)
>   at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:199)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:408)
>   at org.apache.spark.sql.hudi.SerDeUtils$.toObject(SerDeUtils.scala:42)
>   at 
> org.apache.spark.sql.hudi.command.payload.ExpressionPayload$$anon$7.apply(ExpressionPayload.scala:423)
>   at 
> org.apache.spark.sql.hudi.comm

[jira] [Created] (HUDI-5681) Merge Into fails while deserializing expressions

2023-02-01 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5681:
-

 Summary: Merge Into fails while deserializing expressions
 Key: HUDI-5681
 URL: https://issues.apache.org/jira/browse/HUDI-5681
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark-sql
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
 Fix For: 0.13.0


While running our benchmark suite against 0.13 RC, we've stumbled upon 
following exceptions:
{code:java}
23/02/01 08:29:01 ERROR TaskSetManager: Task 1 in stage 947.0 failed 4 times; 
aborting job
2023-02-01T08:29:01.219 ERROR: merge:1:inventory
Job aborted due to stage failure: Task 1 in stage 947.0 failed 4 times, most 
recent failure: Lost task 1.3 in stage 947.0 (TID 101955) 
(ip-172-31-18-9.us-west-2.compute.internal executor 140): 
org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
UPDATE for partition :1
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:336)
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleInsertPartition(BaseSparkCommitActionExecutor.java:342)
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:253)
at 
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
at 
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378)
at 
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1525)
at 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:138)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 
org.apache.spark.sql.catalyst.expressions.Literal
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804)
at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:221)
at com.twitter.chill.Tuple10Serializer.read(TupleSerializers.scala:199)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
at 
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:408)
at org.apache.spark.sql.hudi.SerDeUtils$.toObject(SerDeUtils.scala:42)
at 
org.apache.spark.sql.hudi.command.payload.ExpressionPayload$$anon$7.apply(ExpressionPayload.scala:423)
at 
org.apache.spark.sql.hudi.command.payload.ExpressionPayload$$anon$7.apply(ExpressionPayload.scala:419)
at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2405)
at 
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
at 
com.

[jira] [Updated] (HUDI-5678) deduceShuffleParallelism Returns 0 when that should never happen

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5678:
--
Sprint: Sprint 2023-01-31

> deduceShuffleParallelism Returns 0 when that should never happen
> 
>
> Key: HUDI-5678
> URL: https://issues.apache.org/jira/browse/HUDI-5678
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jonathan Vexler
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: image (1).png
>
>
> This test 
> {code:java}
>   forAll(BulkInsertSortMode.values().toList) { (sortMode: BulkInsertSortMode) 
> =>val sortModeName = sortMode.name()test(s"Test Bulk Insert with 
> BulkInsertSortMode: '$sortModeName'") {  withTempDir { basePath =>
> testBulkInsertPartitioner(basePath, sortModeName)  }}  }
>   def testBulkInsertPartitioner(basePath: File, sortModeName: String): Unit = 
> {val tableName = generateTableName//Remove these with [HUDI-5419]
> spark.sessionState.conf.unsetConf("hoodie.datasource.write.operation")
> spark.sessionState.conf.unsetConf("hoodie.datasource.write.insert.drop.duplicates")
> 
> spark.sessionState.conf.unsetConf("hoodie.merge.allow.duplicate.on.inserts")  
>   
> spark.sessionState.conf.unsetConf("hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled")
> //Default parallelism is 200 which means in global sort, each record will 
> end up in a different spark partition so//9 files would be created. 
> Setting parallelism to 3 so that each spark partition will contain a hudi 
> partition.val parallelism = if 
> (sortModeName.equals(BulkInsertSortMode.GLOBAL_SORT.name())) {  
> "hoodie.bulkinsert.shuffle.parallelism = 3,"} else {  ""}
> spark.sql(  s""" |create table $tableName ( |  id int,
>  |  name string, |  price double, |  dt string |) 
> using hudi | tblproperties ( |  primaryKey = 'id', |  
> preCombineField = 'name', |  type = 'cow', |  $parallelism
>  |  hoodie.bulkinsert.sort.mode = '$sortModeName' | ) | 
> partitioned by (dt) | location 
> '${basePath.getCanonicalPath}/$tableName'""".stripMargin)
> spark.sql("set hoodie.sql.bulk.insert.enable = true")spark.sql("set 
> hoodie.sql.insert.mode = non-strict")spark.sql(  s"""insert into 
> $tableName  values |(5, 'a', 35, '2021-05-21'), |(1, 'a', 31, 
> '2021-01-21'), |(3, 'a', 33, '2021-03-21'), |(4, 'b', 16, 
> '2021-05-21'), |(2, 'b', 18, '2021-01-21'), |(6, 'b', 17, 
> '2021-03-21'), |(8, 'a', 21, '2021-05-21'), |(9, 'a', 22, 
> '2021-01-21'), |(7, 'a', 23, '2021-03-21') |""".stripMargin)  
>   assertResult(3)(spark.sql(s"select distinct _hoodie_file_name from 
> $tableName").count())  } {code}
> Fails due to 
> {code:java}
> requirement failed: Number of partitions (0) must be positive.
> java.lang.IllegalArgumentException: requirement failed: Number of partitions 
> (0) must be positive.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Repartition.(basicLogicalOperators.scala:951)
>   at org.apache.spark.sql.Dataset.coalesce(Dataset.scala:2946)
>   at 
> org.apache.hudi.execution.bulkinsert.PartitionSortPartitionerWithRows.repartitionRecords(PartitionSortPartitionerWithRows.java:48)
>   at 
> org.apache.hudi.execution.bulkinsert.PartitionSortPartitionerWithRows.repartitionRecords(PartitionSortPartitionerWithRows.java:34)
>   at 
> org.apache.hudi.HoodieDatasetBulkInsertHelper$.prepareForBulkInsert(HoodieDatasetBulkInsertHelper.scala:124)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.bulkInsertAsRow(HoodieSparkSqlWriter.scala:763)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:239)
>   at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:107)
>   at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand.run(InsertIntoHoodieTableCommand.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
>   at org.apache.spark.sql.Datas

[jira] [Updated] (HUDI-5678) deduceShuffleParallelism Returns 0 when that should never happen

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5678:
--
Fix Version/s: 0.13.0

> deduceShuffleParallelism Returns 0 when that should never happen
> 
>
> Key: HUDI-5678
> URL: https://issues.apache.org/jira/browse/HUDI-5678
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jonathan Vexler
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: image (1).png
>
>
> This test 
> {code:java}
>   forAll(BulkInsertSortMode.values().toList) { (sortMode: BulkInsertSortMode) 
> =>val sortModeName = sortMode.name()test(s"Test Bulk Insert with 
> BulkInsertSortMode: '$sortModeName'") {  withTempDir { basePath =>
> testBulkInsertPartitioner(basePath, sortModeName)  }}  }
>   def testBulkInsertPartitioner(basePath: File, sortModeName: String): Unit = 
> {val tableName = generateTableName//Remove these with [HUDI-5419]
> spark.sessionState.conf.unsetConf("hoodie.datasource.write.operation")
> spark.sessionState.conf.unsetConf("hoodie.datasource.write.insert.drop.duplicates")
> 
> spark.sessionState.conf.unsetConf("hoodie.merge.allow.duplicate.on.inserts")  
>   
> spark.sessionState.conf.unsetConf("hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled")
> //Default parallelism is 200 which means in global sort, each record will 
> end up in a different spark partition so//9 files would be created. 
> Setting parallelism to 3 so that each spark partition will contain a hudi 
> partition.val parallelism = if 
> (sortModeName.equals(BulkInsertSortMode.GLOBAL_SORT.name())) {  
> "hoodie.bulkinsert.shuffle.parallelism = 3,"} else {  ""}
> spark.sql(  s""" |create table $tableName ( |  id int,
>  |  name string, |  price double, |  dt string |) 
> using hudi | tblproperties ( |  primaryKey = 'id', |  
> preCombineField = 'name', |  type = 'cow', |  $parallelism
>  |  hoodie.bulkinsert.sort.mode = '$sortModeName' | ) | 
> partitioned by (dt) | location 
> '${basePath.getCanonicalPath}/$tableName'""".stripMargin)
> spark.sql("set hoodie.sql.bulk.insert.enable = true")spark.sql("set 
> hoodie.sql.insert.mode = non-strict")spark.sql(  s"""insert into 
> $tableName  values |(5, 'a', 35, '2021-05-21'), |(1, 'a', 31, 
> '2021-01-21'), |(3, 'a', 33, '2021-03-21'), |(4, 'b', 16, 
> '2021-05-21'), |(2, 'b', 18, '2021-01-21'), |(6, 'b', 17, 
> '2021-03-21'), |(8, 'a', 21, '2021-05-21'), |(9, 'a', 22, 
> '2021-01-21'), |(7, 'a', 23, '2021-03-21') |""".stripMargin)  
>   assertResult(3)(spark.sql(s"select distinct _hoodie_file_name from 
> $tableName").count())  } {code}
> Fails due to 
> {code:java}
> requirement failed: Number of partitions (0) must be positive.
> java.lang.IllegalArgumentException: requirement failed: Number of partitions 
> (0) must be positive.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Repartition.(basicLogicalOperators.scala:951)
>   at org.apache.spark.sql.Dataset.coalesce(Dataset.scala:2946)
>   at 
> org.apache.hudi.execution.bulkinsert.PartitionSortPartitionerWithRows.repartitionRecords(PartitionSortPartitionerWithRows.java:48)
>   at 
> org.apache.hudi.execution.bulkinsert.PartitionSortPartitionerWithRows.repartitionRecords(PartitionSortPartitionerWithRows.java:34)
>   at 
> org.apache.hudi.HoodieDatasetBulkInsertHelper$.prepareForBulkInsert(HoodieDatasetBulkInsertHelper.scala:124)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.bulkInsertAsRow(HoodieSparkSqlWriter.scala:763)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:239)
>   at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:107)
>   at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand.run(InsertIntoHoodieTableCommand.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
>   at org.apache.spark.sql.Dataset$$

[jira] [Updated] (HUDI-5678) deduceShuffleParallelism Returns 0 when that should never happen

2023-02-01 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5678:
--
Priority: Blocker  (was: Major)

> deduceShuffleParallelism Returns 0 when that should never happen
> 
>
> Key: HUDI-5678
> URL: https://issues.apache.org/jira/browse/HUDI-5678
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jonathan Vexler
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Attachments: image (1).png
>
>
> This test 
> {code:java}
>   forAll(BulkInsertSortMode.values().toList) { (sortMode: BulkInsertSortMode) 
> =>val sortModeName = sortMode.name()test(s"Test Bulk Insert with 
> BulkInsertSortMode: '$sortModeName'") {  withTempDir { basePath =>
> testBulkInsertPartitioner(basePath, sortModeName)  }}  }
>   def testBulkInsertPartitioner(basePath: File, sortModeName: String): Unit = 
> {val tableName = generateTableName//Remove these with [HUDI-5419]
> spark.sessionState.conf.unsetConf("hoodie.datasource.write.operation")
> spark.sessionState.conf.unsetConf("hoodie.datasource.write.insert.drop.duplicates")
> 
> spark.sessionState.conf.unsetConf("hoodie.merge.allow.duplicate.on.inserts")  
>   
> spark.sessionState.conf.unsetConf("hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled")
> //Default parallelism is 200 which means in global sort, each record will 
> end up in a different spark partition so//9 files would be created. 
> Setting parallelism to 3 so that each spark partition will contain a hudi 
> partition.val parallelism = if 
> (sortModeName.equals(BulkInsertSortMode.GLOBAL_SORT.name())) {  
> "hoodie.bulkinsert.shuffle.parallelism = 3,"} else {  ""}
> spark.sql(  s""" |create table $tableName ( |  id int,
>  |  name string, |  price double, |  dt string |) 
> using hudi | tblproperties ( |  primaryKey = 'id', |  
> preCombineField = 'name', |  type = 'cow', |  $parallelism
>  |  hoodie.bulkinsert.sort.mode = '$sortModeName' | ) | 
> partitioned by (dt) | location 
> '${basePath.getCanonicalPath}/$tableName'""".stripMargin)
> spark.sql("set hoodie.sql.bulk.insert.enable = true")spark.sql("set 
> hoodie.sql.insert.mode = non-strict")spark.sql(  s"""insert into 
> $tableName  values |(5, 'a', 35, '2021-05-21'), |(1, 'a', 31, 
> '2021-01-21'), |(3, 'a', 33, '2021-03-21'), |(4, 'b', 16, 
> '2021-05-21'), |(2, 'b', 18, '2021-01-21'), |(6, 'b', 17, 
> '2021-03-21'), |(8, 'a', 21, '2021-05-21'), |(9, 'a', 22, 
> '2021-01-21'), |(7, 'a', 23, '2021-03-21') |""".stripMargin)  
>   assertResult(3)(spark.sql(s"select distinct _hoodie_file_name from 
> $tableName").count())  } {code}
> Fails due to 
> {code:java}
> requirement failed: Number of partitions (0) must be positive.
> java.lang.IllegalArgumentException: requirement failed: Number of partitions 
> (0) must be positive.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Repartition.(basicLogicalOperators.scala:951)
>   at org.apache.spark.sql.Dataset.coalesce(Dataset.scala:2946)
>   at 
> org.apache.hudi.execution.bulkinsert.PartitionSortPartitionerWithRows.repartitionRecords(PartitionSortPartitionerWithRows.java:48)
>   at 
> org.apache.hudi.execution.bulkinsert.PartitionSortPartitionerWithRows.repartitionRecords(PartitionSortPartitionerWithRows.java:34)
>   at 
> org.apache.hudi.HoodieDatasetBulkInsertHelper$.prepareForBulkInsert(HoodieDatasetBulkInsertHelper.scala:124)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.bulkInsertAsRow(HoodieSparkSqlWriter.scala:763)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:239)
>   at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:107)
>   at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand.run(InsertIntoHoodieTableCommand.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
>   at org.apache.spark.sql.Dataset$$anonfun$53.apply(Datas

[jira] [Created] (HUDI-5679) Deduplication in row-writing bulk-insert ends up w/ OOMs

2023-02-01 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5679:
-

 Summary: Deduplication in row-writing bulk-insert ends up w/ OOMs
 Key: HUDI-5679
 URL: https://issues.apache.org/jira/browse/HUDI-5679
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark, spark-sql
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
 Fix For: 0.13.0


I'm running a small (1Gb) scale benchmark on a fat cluster (50Gb memory) and am 
still somehow getting OOMs when deduplication is enabled in row-writing 
bulk-insert path in CTAS



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5670) Server-based markers creation times out

2023-01-31 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5670:
--
Summary: Server-based markers creation times out  (was: Server-based Marker 
creation times out)

> Server-based markers creation times out
> ---
>
> Key: HUDI-5670
> URL: https://issues.apache.org/jira/browse/HUDI-5670
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.1
>
>
> Running writing benchmarks w/ SparkRecordMerger enabled hitting this 
> SocketTimeoutException trying to create markers:
> {code:java}
> ception.HoodieRemoteException: Failed to create marker file 
> partition=2020%2F10%2F29/69adadb4-d7ae-4b30-8af1-92ffa38be7df-0_1362-352-97811_20230201020238055.parquet.marker.CREATE
> Read timed out
>         at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
>         at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>         at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>         at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223)
>         at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:352)
>         at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1535)
>         at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1445)
>         at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1509)
>         at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1332)
>         at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
>         at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>         at org.apache.spark.scheduler.Task.run(Task.scala:136)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
> Caused by: org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
> partition=2020%2F10%2F29/69adadb4-d7ae-4b30-8af1-92ffa38be7df-0_1362-352-97811_20230201020238055.parquet.marker.CREATE
> Read timed out
>         at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:84)
>         at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:39)
>         at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
>         ... 22 more
> Caused by: org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
> partition=2020%2F10%2F29/69adadb4-d7ae-4b30-8af1-92ffa38be7df-0_1362-352-97811_20230201020238055.parquet.marker.CREATE
> Read timed out
>         at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:84)
>                                                                               
>                                                                               
> [3324/4704]
>         at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:39)
>         at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
>         ... 22 more
> Caused by: org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
> partition=2020%2F11%2F29/8e3045e1-6de0-492e-bc34-85e2b8502767-0_1207-352-97656_20230201020238055.parquet.marker.CREATE
> Read timed out
>         at 
> org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:73)
>         at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:80)
>         ... 24 more
>

[jira] [Updated] (HUDI-5670) Server-based Marker creation times out

2023-01-31 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5670:
--
Fix Version/s: 0.13.1

> Server-based Marker creation times out
> --
>
> Key: HUDI-5670
> URL: https://issues.apache.org/jira/browse/HUDI-5670
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.1
>
>
> Running writing benchmarks w/ SparkRecordMerger enabled hitting this 
> SocketTimeoutException trying to create markers:
> {code:java}
> ception.HoodieRemoteException: Failed to create marker file 
> partition=2020%2F10%2F29/69adadb4-d7ae-4b30-8af1-92ffa38be7df-0_1362-352-97811_20230201020238055.parquet.marker.CREATE
> Read timed out
>         at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
>         at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>         at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>         at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223)
>         at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:352)
>         at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1535)
>         at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1445)
>         at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1509)
>         at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1332)
>         at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
>         at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>         at org.apache.spark.scheduler.Task.run(Task.scala:136)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
> Caused by: org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
> partition=2020%2F10%2F29/69adadb4-d7ae-4b30-8af1-92ffa38be7df-0_1362-352-97811_20230201020238055.parquet.marker.CREATE
> Read timed out
>         at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:84)
>         at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:39)
>         at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
>         ... 22 more
> Caused by: org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
> partition=2020%2F10%2F29/69adadb4-d7ae-4b30-8af1-92ffa38be7df-0_1362-352-97811_20230201020238055.parquet.marker.CREATE
> Read timed out
>         at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:84)
>                                                                               
>                                                                               
> [3324/4704]
>         at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:39)
>         at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
>         ... 22 more
> Caused by: org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
> partition=2020%2F11%2F29/8e3045e1-6de0-492e-bc34-85e2b8502767-0_1207-352-97656_20230201020238055.parquet.marker.CREATE
> Read timed out
>         at 
> org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:73)
>         at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:80)
>         ... 24 more
> Caused by: org.apache.hudi.exception.HoodieRemoteException: Failed to create

[jira] [Created] (HUDI-5670) Server-based Marker creation times out

2023-01-31 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5670:
-

 Summary: Server-based Marker creation times out
 Key: HUDI-5670
 URL: https://issues.apache.org/jira/browse/HUDI-5670
 Project: Apache Hudi
  Issue Type: Bug
  Components: writer-core
Reporter: Alexey Kudinkin
Assignee: Ethan Guo


Running writing benchmarks w/ SparkRecordMerger enabled hitting this 
SocketTimeoutException trying to create markers:
{code:java}
ception.HoodieRemoteException: Failed to create marker file 
partition=2020%2F10%2F29/69adadb4-d7ae-4b30-8af1-92ffa38be7df-0_1362-352-97811_20230201020238055.parquet.marker.CREATE
Read timed out
        at 
org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
        at 
scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
        at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
        at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223)
        at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:352)
        at 
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1535)
        at 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1445)
        at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1509)
        at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1332)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.hudi.exception.HoodieException: 
org.apache.hudi.exception.HoodieException: 
org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
partition=2020%2F10%2F29/69adadb4-d7ae-4b30-8af1-92ffa38be7df-0_1362-352-97811_20230201020238055.parquet.marker.CREATE
Read timed out
        at 
org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:84)
        at 
org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:39)
        at 
org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
        ... 22 more
Caused by: org.apache.hudi.exception.HoodieException: 
org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
partition=2020%2F10%2F29/69adadb4-d7ae-4b30-8af1-92ffa38be7df-0_1362-352-97811_20230201020238055.parquet.marker.CREATE
Read timed out
        at 
org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:84)
                                                                                
                                                                            
[3324/4704]
        at 
org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:39)
        at 
org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
        ... 22 more
Caused by: org.apache.hudi.exception.HoodieException: 
org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
partition=2020%2F11%2F29/8e3045e1-6de0-492e-bc34-85e2b8502767-0_1207-352-97656_20230201020238055.parquet.marker.CREATE
Read timed out
        at 
org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:73)
        at 
org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:80)
        ... 24 more
Caused by: org.apache.hudi.exception.HoodieRemoteException: Failed to create 
marker file 
partition=2020%2F11%2F29/8e3045e1-6de0-492e-bc34-85e2b8502767-0_1207-352-97656_20230201020238055.parquet.marker.CREATE
Read timed out
        at 
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.executeCreateMarkerRequest(TimelineServerBasedWriteMarkers.java:186)
        at 
org.apache.hudi.table.marker.TimelineServerBasedWriteMarkers.create(TimelineServerBasedWriteMarker

[jira] [Updated] (HUDI-5656) Metadata Bootstrap flow resulting in NPE

2023-01-30 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5656:
--
Sprint: 0.13.0 Final Sprint 3

> Metadata Bootstrap flow resulting in NPE
> 
>
> Key: HUDI-5656
> URL: https://issues.apache.org/jira/browse/HUDI-5656
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap
>Affects Versions: 0.13.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>
> After adding a simple statement forcing the test to read whole bootstrapped 
> table:
> {code:java}
> sqlContext.sql("select * from bootstrapped").show(); {code}
>  
> Following NPE have been observed on master 
> (testBulkInsertsAndUpsertsWithBootstrap):
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 183.0 failed 1 times, most recent failure: Lost task 0.0 in stage 183.0 
> (TID 971, localhost, executor driver): java.lang.NullPointerException
>     at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_1$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>     at scala.collection.Iterator$$anon$10.next(Iterator.scala:448)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:256)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:836)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:836)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>     at org.apache.spark.scheduler.Task.run(Task.scala:123)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)Driver stacktrace:    at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:1889)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1877)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1876)
>     at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:59)
>     at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:52)
>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>     at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:926)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:926)
>     at scala.Option.foreach(Option.scala:257)
>     at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
>     at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>     at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
>     at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:365)
>     at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
>     at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3389)
>     at or

[jira] [Created] (HUDI-5656) Metadata Bootstrap flow resulting in NPE

2023-01-30 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5656:
-

 Summary: Metadata Bootstrap flow resulting in NPE
 Key: HUDI-5656
 URL: https://issues.apache.org/jira/browse/HUDI-5656
 Project: Apache Hudi
  Issue Type: Bug
  Components: bootstrap
Affects Versions: 0.13.0
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin


After adding a simple statement forcing the test to read whole bootstrapped 
table:
{code:java}
sqlContext.sql("select * from bootstrapped").show(); {code}
 

Following NPE have been observed on master 
(testBulkInsertsAndUpsertsWithBootstrap):
{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 183.0 failed 1 times, most recent failure: Lost task 0.0 in stage 183.0 
(TID 971, localhost, executor driver): java.lang.NullPointerException
    at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_1$(Unknown
 Source)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
    at scala.collection.Iterator$$anon$10.next(Iterator.scala:448)
    at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:256)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:836)
    at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:836)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)Driver stacktrace:    at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:1889)
    at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1877)
    at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1876)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:52)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
    at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:926)
    at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:926)
    at scala.Option.foreach(Option.scala:257)
    at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
    at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
    at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
    at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
    at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:365)
    at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
    at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3389)
    at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2550)
    at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3370)
    at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
    at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at 
org.apache.spark.sql.execution.SQLExecutio

[jira] [Created] (HUDI-5653) Troubleshoot flaky TestHoodieDeltaStreamerWithMultiWriter

2023-01-30 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5653:
-

 Summary: Troubleshoot flaky TestHoodieDeltaStreamerWithMultiWriter
 Key: HUDI-5653
 URL: https://issues.apache.org/jira/browse/HUDI-5653
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin
Assignee: sivabalan narayanan


Let's follow up w/ TestHoodieDeltaStreamerWithMultiWriter and make sure we 
transform it in a way that would allow us to run it reliably



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5641) Streamline Advanced Schema Evolution flow

2023-01-28 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5641:
-

 Summary: Streamline Advanced Schema Evolution flow
 Key: HUDI-5641
 URL: https://issues.apache.org/jira/browse/HUDI-5641
 Project: Apache Hudi
  Issue Type: Bug
Affects Versions: 0.13.0
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
 Fix For: 0.13.1


Currently, Schema Evolution not always applied consistently and sometimes is 
re-applied multiple times causing issues for HoodieSparkRecord implementations 
(that is optimized to reuse underlying buffer):
 # HoodieMergeHelper would apply SE transformer, then
 # HoodieMergeHandle would run rewriteRecordWithNewSchema again



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5633) Fixing HoodieSparkRecord performance bottlenecks

2023-01-27 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5633:
--
Fix Version/s: 0.13.0

> Fixing HoodieSparkRecord performance bottlenecks
> 
>
> Key: HUDI-5633
> URL: https://issues.apache.org/jira/browse/HUDI-5633
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> There currently following issues w/ the current HoodieSparkRecord 
> implementation:
>  # It rewrites records using `rewriteRecord` and `rewriteRecordWithNewSchema` 
> which do Schema traversals for every record. Instead we should do schema 
> traversal only once and produce a transformer that will directly create new 
> record from the old one.
>  # Records are currently copied for every Executor even for Simple one which 
> actually is not buffering any records and therefore doesn't require records 
> to be copied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5633) Fixing HoodieSparkRecord performance bottlenecks

2023-01-27 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5633:
--
Status: In Progress  (was: Open)

> Fixing HoodieSparkRecord performance bottlenecks
> 
>
> Key: HUDI-5633
> URL: https://issues.apache.org/jira/browse/HUDI-5633
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>
> There currently following issues w/ the current HoodieSparkRecord 
> implementation:
>  # It rewrites records using `rewriteRecord` and `rewriteRecordWithNewSchema` 
> which do Schema traversals for every record. Instead we should do schema 
> traversal only once and produce a transformer that will directly create new 
> record from the old one.
>  # Records are currently copied for every Executor even for Simple one which 
> actually is not buffering any records and therefore doesn't require records 
> to be copied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5633) Fixing HoodieSparkRecord performance bottlenecks

2023-01-27 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5633:
-

 Summary: Fixing HoodieSparkRecord performance bottlenecks
 Key: HUDI-5633
 URL: https://issues.apache.org/jira/browse/HUDI-5633
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin


There currently following issues w/ the current HoodieSparkRecord 
implementation:
 # It rewrites records using `rewriteRecord` and `rewriteRecordWithNewSchema` 
which do Schema traversals for every record. Instead we should do schema 
traversal only once and produce a transformer that will directly create new 
record from the old one.
 # Records are currently copied for every Executor even for Simple one which 
actually is not buffering any records and therefore doesn't require records to 
be copied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5633) Fixing HoodieSparkRecord performance bottlenecks

2023-01-27 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5633:
--
Sprint: 0.13.0 Final Sprint 3

> Fixing HoodieSparkRecord performance bottlenecks
> 
>
> Key: HUDI-5633
> URL: https://issues.apache.org/jira/browse/HUDI-5633
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>
> There currently following issues w/ the current HoodieSparkRecord 
> implementation:
>  # It rewrites records using `rewriteRecord` and `rewriteRecordWithNewSchema` 
> which do Schema traversals for every record. Instead we should do schema 
> traversal only once and produce a transformer that will directly create new 
> record from the old one.
>  # Records are currently copied for every Executor even for Simple one which 
> actually is not buffering any records and therefore doesn't require records 
> to be copied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5633) Fixing HoodieSparkRecord performance bottlenecks

2023-01-27 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5633:
--
Status: Patch Available  (was: In Progress)

> Fixing HoodieSparkRecord performance bottlenecks
> 
>
> Key: HUDI-5633
> URL: https://issues.apache.org/jira/browse/HUDI-5633
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>
> There currently following issues w/ the current HoodieSparkRecord 
> implementation:
>  # It rewrites records using `rewriteRecord` and `rewriteRecordWithNewSchema` 
> which do Schema traversals for every record. Instead we should do schema 
> traversal only once and produce a transformer that will directly create new 
> record from the old one.
>  # Records are currently copied for every Executor even for Simple one which 
> actually is not buffering any records and therefore doesn't require records 
> to be copied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5534) Optimize Bloom Index lookup DAG

2023-01-27 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-5534.
-
Resolution: Fixed

> Optimize Bloom Index lookup DAG
> ---
>
> Key: HUDI-5534
> URL: https://issues.apache.org/jira/browse/HUDI-5534
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> There are some low-hanging performance optimizations that could considerably 
> improve performance of the Bloom Index lookup seq:
>  # Map file-comparison pairs to PairRDD (where key is file-name, and value is 
> record-key) instead of RDD, this would allow us to 
>  ## Do sorting by filename (to make sure we check all records w/in the file 
> all at once) w/in a single Spark partition instead of global one (reducing 
> shuffling as well)
>  ## Avoid re-shuffling (by re-mapping from RDD to PairRDD later)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5622) Refactor Write Executors tests to avoid code duplication

2023-01-25 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5622:
--
Priority: Minor  (was: Major)

> Refactor Write Executors tests to avoid code duplication
> 
>
> Key: HUDI-5622
> URL: https://issues.apache.org/jira/browse/HUDI-5622
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Alexey Kudinkin
>Priority: Minor
>
> Currently, tests for various executors are simply a duplication of each 
> other. Instead, we should refactor this to share common logic (which is the 
> most of it)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5622) Refactor Write Executors tests to avoid code duplication

2023-01-25 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5622:
-

 Summary: Refactor Write Executors tests to avoid code duplication
 Key: HUDI-5622
 URL: https://issues.apache.org/jira/browse/HUDI-5622
 Project: Apache Hudi
  Issue Type: Test
Reporter: Alexey Kudinkin


Currently, tests for various executors are simply a duplication of each other. 
Instead, we should refactor this to share common logic (which is the most of it)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5622) Refactor Write Executors tests to avoid code duplication

2023-01-25 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5622:
--
Labels: newbie  (was: )

> Refactor Write Executors tests to avoid code duplication
> 
>
> Key: HUDI-5622
> URL: https://issues.apache.org/jira/browse/HUDI-5622
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Alexey Kudinkin
>Priority: Minor
>  Labels: newbie
>
> Currently, tests for various executors are simply a duplication of each 
> other. Instead, we should refactor this to share common logic (which is the 
> most of it)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5363) Remove default parallelism values for all ops

2023-01-25 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-5363.
-
Resolution: Fixed

> Remove default parallelism values for all ops
> -
>
> Key: HUDI-5363
> URL: https://issues.apache.org/jira/browse/HUDI-5363
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently, we always override the parallelism of the incoming datasets:
>  # If user specified shuffle parallelism explicitly, we'd use it to override 
> the original one
>  # If user did NOT specify shuffle parallelism, we'd use default value of 200
> Second case is problematic: we're blindly overriding "natural" parallelism of 
> the data (determined based on the source of the data) and replace it with 
> static unrelated value.
> Instead, we should only be overriding the parallelism in following cases:
>  # User provided an overriding value explicitly



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5619) Fix HoodieTableFileSystemView inefficient latest base-file lookups

2023-01-25 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5619:
-

 Summary: Fix HoodieTableFileSystemView inefficient latest 
base-file lookups
 Key: HUDI-5619
 URL: https://issues.apache.org/jira/browse/HUDI-5619
 Project: Apache Hudi
  Issue Type: Bug
  Components: core
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
 Fix For: 0.13.1


Currently, HoodieTableFileSystemView when looking up latest base-file in a 
single file-group [have to process the whole 
partition|https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java#L584],
 which is obviously not very efficient.

Instead, we should be able to lookup and process just the file-group in 
question.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5271) Inconsistent reader and writer schema in HoodieAvroDataBlock cause exception

2023-01-24 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5271:
--
Priority: Critical  (was: Major)

> Inconsistent reader and writer schema in HoodieAvroDataBlock cause exception
> 
>
> Key: HUDI-5271
> URL: https://issues.apache.org/jira/browse/HUDI-5271
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Teng Huo
>Assignee: Teng Huo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> Exception detail in https://github.com/apache/hudi/issues/7284



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5271) Inconsistent reader and writer schema in HoodieAvroDataBlock cause exception

2023-01-24 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5271:
--
Fix Version/s: 0.13.1

> Inconsistent reader and writer schema in HoodieAvroDataBlock cause exception
> 
>
> Key: HUDI-5271
> URL: https://issues.apache.org/jira/browse/HUDI-5271
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Teng Huo
>Assignee: Teng Huo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> Exception detail in https://github.com/apache/hudi/issues/7284



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4937) Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT readers

2023-01-24 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4937:
--
Fix Version/s: 0.13.1
   (was: 0.13.0)

> Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT 
> readers
> -
>
> Key: HUDI-4937
> URL: https://issues.apache.org/jira/browse/HUDI-4937
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, writer-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> Currently, `HoodieTable` is holding `HoodieBackedTableMetadata` that is setup 
> not to reuse actual LogScanner and HFileReader used to read MT itself.
> This is proving to be wasteful on a number of occasions already, including 
> (not an exhaustive list):
> https://github.com/apache/hudi/issues/6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5552) Too slow while using trino-hudi connector while querying partitioned tables.

2023-01-24 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin reassigned HUDI-5552:
-

Assignee: Sagar Sumit  (was: Ethan Guo)

> Too slow while using trino-hudi connector while querying partitioned tables.
> 
>
> Key: HUDI-5552
> URL: https://issues.apache.org/jira/browse/HUDI-5552
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: trino-presto
>Reporter: Danny Chen
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.13.0
>
>
> See the issue for details: [[SUPPORT] Too slow while using trino-hudi 
> connector while querying partitioned tables. · Issue #7643 · apache/hudi 
> (github.com)|https://github.com/apache/hudi/issues/7643]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5443) Fix exception when querying MOR table after applying NestedSchemaPruning optimization

2023-01-24 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-5443.
-
Resolution: Fixed

> Fix exception when querying MOR table after applying NestedSchemaPruning 
> optimization
> -
>
> Key: HUDI-5443
> URL: https://issues.apache.org/jira/browse/HUDI-5443
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, spark-sql
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> This has been discovered while working on HUDI-5384.
> After NestedSchemaPruning has been applied successfully, reading from MOR 
> table could encountered following exception when actual delta-log file 
> merging would be performed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5520) Fail MDT when list of log files grows unboundedly

2023-01-24 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5520:
--
Fix Version/s: 0.13.1
   (was: 0.13.0)

> Fail MDT when list of log files grows unboundedly
> -
>
> Key: HUDI-5520
> URL: https://issues.apache.org/jira/browse/HUDI-5520
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5392) Fix Bootstrap files reader to configure arrays to be read in the new format

2023-01-24 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-5392.
-
Resolution: Fixed

> Fix Bootstrap files reader to configure arrays to be read in the new format
> ---
>
> Key: HUDI-5392
> URL: https://issues.apache.org/jira/browse/HUDI-5392
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> When writing Bootstrap file we’re using Spark writer that writes arrays in 
> the new format, while Hudi reads it in the old (Avro compatible) format:
> {code:java}
>  // Old
>  optional group tip_history (LIST) {
> repeated group array {
>   optional double amount;
>   optional binary currency (UTF8);
> }
>   }
>  // new
>  optional group tip_history (LIST) {
> repeated group list {
>   optional group element {
> optional double amount;
> optional binary currency (UTF8);
>   }
> }
>   } {code}
>  
> To fix that we need to make sure that Bootstrap files are *always* read in a 
> new format (Spark default) unlike Hudi's Parquet files
> We also need to fix TestDataSourceForBootstrap, as it currently doesn't 
> actually assert that the records are written correctly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5608) Support decimals w/ precision > 30 in Column Stats

2023-01-24 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5608:
-

 Summary: Support decimals w/ precision > 30 in Column Stats
 Key: HUDI-5608
 URL: https://issues.apache.org/jira/browse/HUDI-5608
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark
Affects Versions: 0.12.2
Reporter: Alexey Kudinkin


As reported in: [https://github.com/apache/hudi/issues/7732]

 

Currently we've limited precision of the supported decimals at 30 assuming that 
this number is reasonably high to cover 99% of use-cases, but it seems like 
there's still a demand for even larger Decimals.

The challenge is however to balance the need to support longer Decimals vs 
storage space we have to provision for each one of them.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-5392) Fix Bootstrap files reader to configure arrays to be read in the new format

2023-01-24 Thread Alexey Kudinkin (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-5392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17647735#comment-17647735
 ] 

Alexey Kudinkin edited comment on HUDI-5392 at 1/24/23 8:13 AM:


Another contributing issue is that when reading Bootstrap file we don't specify 
the expected schema and therefore records from the Bootstrap file are read in 
the schema decode from Parquet file. This is problematic b/c when we validate 
the Avro schemas their corresponding names are checked and this creates 
mismatches since Parquet schemas don't bear names/namespaces (of the structs)


was (Author: alexey.kudinkin):
Another contributing issue is that when reading Bootstrap file we don't specify 
the expected schema and therefore records from the Bootstrap file are read in 
the schema decode from file's Parquet one. This is problematic b/c when we 
validate the Avro schemas their corresponding names are checked and this 
creates mismatches since Parquet schemas don't bear names/namespaces (of the 
structs)

> Fix Bootstrap files reader to configure arrays to be read in the new format
> ---
>
> Key: HUDI-5392
> URL: https://issues.apache.org/jira/browse/HUDI-5392
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> When writing Bootstrap file we’re using Spark writer that writes arrays in 
> the new format, while Hudi reads it in the old (Avro compatible) format:
> {code:java}
>  // Old
>  optional group tip_history (LIST) {
> repeated group array {
>   optional double amount;
>   optional binary currency (UTF8);
> }
>   }
>  // new
>  optional group tip_history (LIST) {
> repeated group list {
>   optional group element {
> optional double amount;
> optional binary currency (UTF8);
>   }
> }
>   } {code}
>  
> To fix that we need to make sure that Bootstrap files are *always* read in a 
> new format (Spark default) unlike Hudi's Parquet files
> We also need to fix TestDataSourceForBootstrap, as it currently doesn't 
> actually assert that the records are written correctly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5605) Tests spend > 50% of time serializing Hadoop's Configuration

2023-01-23 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5605:
-

 Summary: Tests spend > 50% of time serializing Hadoop's 
Configuration
 Key: HUDI-5605
 URL: https://issues.apache.org/jira/browse/HUDI-5605
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin
 Attachments: Screenshot 2023-01-23 at 8.46.52 PM.png

This is something we should analyze and investigate how we can bring this 
number down, where currently some of our tests spend > 50% of time serializing 
Configuration.

 

!Screenshot 2023-01-23 at 8.46.52 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5392) Fix Bootstrap files reader to configure arrays to be read in the new format

2023-01-23 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5392:
--
Status: Patch Available  (was: In Progress)

> Fix Bootstrap files reader to configure arrays to be read in the new format
> ---
>
> Key: HUDI-5392
> URL: https://issues.apache.org/jira/browse/HUDI-5392
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> When writing Bootstrap file we’re using Spark writer that writes arrays in 
> the new format, while Hudi reads it in the old (Avro compatible) format:
> {code:java}
>  // Old
>  optional group tip_history (LIST) {
> repeated group array {
>   optional double amount;
>   optional binary currency (UTF8);
> }
>   }
>  // new
>  optional group tip_history (LIST) {
> repeated group list {
>   optional group element {
> optional double amount;
> optional binary currency (UTF8);
>   }
> }
>   } {code}
>  
> To fix that we need to make sure that Bootstrap files are *always* read in a 
> new format (Spark default) unlike Hudi's Parquet files
> We also need to fix TestDataSourceForBootstrap, as it currently doesn't 
> actually assert that the records are written correctly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >