[jira] [Commented] (HUDI-7938) Missed HoodieSparkKryoRegistrar in Hadoop config by default

2024-07-11 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17864995#comment-17864995
 ] 

Geser Dugarov commented on HUDI-7938:
-

Raised an issue to discuss expected behavior in 1.0-rc2 release:

https://github.com/apache/hudi/issues/11616

> Missed HoodieSparkKryoRegistrar in Hadoop config by default
> ---
>
> Key: HUDI-7938
> URL: https://issues.apache.org/jira/browse/HUDI-7938
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> HUDI-7567 Add schema evolution to the filegroup reader (#10957),
> but broke integration with PySpark.
> When trying to call
> {quote}df_load = 
> spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path)
> df_load.collect()
> {quote}
>  
> got:
>  
> {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 
> (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException
>     at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:139)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> {quote}
> Spark 3.4.3 was used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7709) ClassCastException while reading the data using TimestampBasedKeyGenerator

2024-07-11 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17864971#comment-17864971
 ] 

Geser Dugarov commented on HUDI-7709:
-

[~codope] I prepared another fix without any use of nulls. Could you, please, 
look at the corresponding MR:

[https://github.com/apache/hudi/pull/11615]

?

> ClassCastException while reading the data using TimestampBasedKeyGenerator
> --
>
> Key: HUDI-7709
> URL: https://issues.apache.org/jira/browse/HUDI-7709
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Github Issue - [https://github.com/apache/hudi/issues/11140]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-7709) ClassCastException while reading the data using TimestampBasedKeyGenerator

2024-07-11 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17864959#comment-17864959
 ] 

Geser Dugarov edited comment on HUDI-7709 at 7/11/24 8:16 AM:
--

[~codope] If you don't mind, could you, please, write any description how to 
reproduce NPE? Couldn't find suitable test scenario.


was (Author: JIRAUSER301110):
[~codope] If you don't mind, could you, please, write any description how to 
reproduce NPE?

 

> ClassCastException while reading the data using TimestampBasedKeyGenerator
> --
>
> Key: HUDI-7709
> URL: https://issues.apache.org/jira/browse/HUDI-7709
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Github Issue - [https://github.com/apache/hudi/issues/11140]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7709) ClassCastException while reading the data using TimestampBasedKeyGenerator

2024-07-11 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17864959#comment-17864959
 ] 

Geser Dugarov commented on HUDI-7709:
-

[~codope] If you don't mind, could you, please, write any description how to 
reproduce NPE?

 

> ClassCastException while reading the data using TimestampBasedKeyGenerator
> --
>
> Key: HUDI-7709
> URL: https://issues.apache.org/jira/browse/HUDI-7709
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Github Issue - [https://github.com/apache/hudi/issues/11140]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7709) ClassCastException while reading the data using TimestampBasedKeyGenerator

2024-07-10 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7709:

Summary: ClassCastException while reading the data using 
TimestampBasedKeyGenerator  (was: Class Cast Exception while reading the data 
using TimestampBasedKeyGenerator)

> ClassCastException while reading the data using TimestampBasedKeyGenerator
> --
>
> Key: HUDI-7709
> URL: https://issues.apache.org/jira/browse/HUDI-7709
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Github Issue - [https://github.com/apache/hudi/issues/11140]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-7938) Missed HoodieSparkKryoRegistrar in broadcasted storage config

2024-07-05 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862726#comment-17862726
 ] 

Geser Dugarov edited comment on HUDI-7938 at 7/5/24 8:33 AM:
-

There is missed

spark.kryo.registrator = org.apache.spark.HoodieSparkKryoRegistrar

in Hadoop configuration.


was (Author: JIRAUSER301110):
There is missed

spark.kryo.registrator = org.apache.spark.HoodieSparkKryoRegistrar

in configuration.

> Missed HoodieSparkKryoRegistrar in broadcasted storage config
> -
>
> Key: HUDI-7938
> URL: https://issues.apache.org/jira/browse/HUDI-7938
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> HUDI-7567 Add schema evolution to the filegroup reader (#10957),
> but broke integration with PySpark.
> When trying to call
> {quote}df_load = 
> spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path)
> df_load.collect()
> {quote}
>  
> got:
>  
> {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 
> (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException
>     at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:139)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> {quote}
> Spark 3.4.3 was used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7938) Missed HoodieSparkKryoRegistrar in Hadoop config by default

2024-07-05 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7938:

Summary: Missed HoodieSparkKryoRegistrar in Hadoop config by default  (was: 
Missed HoodieSparkKryoRegistrar in broadcasted storage config)

> Missed HoodieSparkKryoRegistrar in Hadoop config by default
> ---
>
> Key: HUDI-7938
> URL: https://issues.apache.org/jira/browse/HUDI-7938
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> HUDI-7567 Add schema evolution to the filegroup reader (#10957),
> but broke integration with PySpark.
> When trying to call
> {quote}df_load = 
> spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path)
> df_load.collect()
> {quote}
>  
> got:
>  
> {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 
> (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException
>     at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:139)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> {quote}
> Spark 3.4.3 was used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7938) Missed HoodieSparkKryoRegistrar in broadcasted storage config

2024-07-04 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7938:

Summary: Missed HoodieSparkKryoRegistrar in broadcasted storage config  
(was: NullPointerException during read from PySpark)

> Missed HoodieSparkKryoRegistrar in broadcasted storage config
> -
>
> Key: HUDI-7938
> URL: https://issues.apache.org/jira/browse/HUDI-7938
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> HUDI-7567 Add schema evolution to the filegroup reader (#10957),
> but broke integration with PySpark.
> When trying to call
> {quote}df_load = 
> spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path)
> df_load.collect()
> {quote}
>  
> got:
>  
> {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 
> (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException
>     at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:139)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> {quote}
> Spark 3.4.3 was used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used in partition column

2024-07-04 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7952:

Summary: Incorrect partition pruning when TimestampBasedKeyGenerator is 
used in partition column  (was: Incorrect partition pruning when 
TimestampBasedKeyGenerator is used)

> Incorrect partition pruning when TimestampBasedKeyGenerator is used in 
> partition column
> ---
>
> Key: HUDI-7952
> URL: https://issues.apache.org/jira/browse/HUDI-7952
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
> with nulls as partition columns values could lead to an empty query results.
> HoodieFileIndex.listFiles() would return Seq of 
> {color:#00}PartitionDirectory with null values.{color}
>  
> {color:#00}But there is another problem with range filters on partition 
> column.{color}
> {color:#00}For instance, we have UNIX_TIMESTAMP in column ts.{color}
> And the table is also partitioned by ts with
> hoodie.keygen.timebased.output.dateformat = "-MM-dd HH"
> {color:#00}For execution of query like:{color}
> SELECT ... WHERE ts BETWEEN 1078016000 and 1718953003 ...
> it's not possible to filter rows properly.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used

2024-07-04 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7952:

Description: 
Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
with nulls as partition columns values could lead to an empty query results.

HoodieFileIndex.listFiles() would return Seq of 
{color:#00}PartitionDirectory with null values.{color}

 

{color:#00}But there is another problem with range filters on partition 
column.{color}

{color:#00}For instance, we have UNIX_TIMESTAMP in column ts.{color}

And the table is also partitioned by ts with

hoodie.keygen.timebased.output.dateformat = "-MM-dd HH"

{color:#00}For execution of query like:{color}

SELECT ... WHERE ts BETWEEN 1078016000 and 1718953003 ...

it's not possible to filter rows properly.

 

  was:
Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
with nulls as partition columns values could lead to an empty query results.

HoodieFileIndex.listFiles() would return Seq of 
{color:#00}PartitionDirectory with null values.{color}

 

{color:#00}But there is another problem with partition range filters.{color}

{color:#00}For instance, for UNIX_TIMESTAMP, column ts, we set:{color}

SELECT ... WHERE ts BETWEEN 1078016000 and 1718953003 ...

And the table is also partitioned by ts with

hoodie.keygen.timebased.output.dateformat = "-MM-dd HH"

 


> Incorrect partition pruning when TimestampBasedKeyGenerator is used
> ---
>
> Key: HUDI-7952
> URL: https://issues.apache.org/jira/browse/HUDI-7952
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
> with nulls as partition columns values could lead to an empty query results.
> HoodieFileIndex.listFiles() would return Seq of 
> {color:#00}PartitionDirectory with null values.{color}
>  
> {color:#00}But there is another problem with range filters on partition 
> column.{color}
> {color:#00}For instance, we have UNIX_TIMESTAMP in column ts.{color}
> And the table is also partitioned by ts with
> hoodie.keygen.timebased.output.dateformat = "-MM-dd HH"
> {color:#00}For execution of query like:{color}
> SELECT ... WHERE ts BETWEEN 1078016000 and 1718953003 ...
> it's not possible to filter rows properly.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used

2024-07-04 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7952:

Description: 
Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
with nulls as partition columns values could lead to an empty query results.

HoodieFileIndex.listFiles() would return Seq of 
{color:#00}PartitionDirectory with null values.{color}

 

{color:#00}But there is another problem with partition range filters.{color}

{color:#00}For instance, for UNIX_TIMESTAMP, column ts, we set:{color}

SELECT ... WHERE ts BETWEEN 1078016000 and 1718953003 ...

And the table is also partitioned by ts with

hoodie.keygen.timebased.output.dateformat = "-MM-dd HH"

 

  was:
Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
with nulls as partition columns values could lead to an empty query results.

HoodieFileIndex.listFiles() would return Seq of 
{color:#00}PartitionDirectory with null values.{color}

 

{color:#00}But also there is a problem with partition range filters.{color}

{color:#00}For instance, for UNIX_TIMESTAMP we set:
{color}

 

 


> Incorrect partition pruning when TimestampBasedKeyGenerator is used
> ---
>
> Key: HUDI-7952
> URL: https://issues.apache.org/jira/browse/HUDI-7952
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
> with nulls as partition columns values could lead to an empty query results.
> HoodieFileIndex.listFiles() would return Seq of 
> {color:#00}PartitionDirectory with null values.{color}
>  
> {color:#00}But there is another problem with partition range 
> filters.{color}
> {color:#00}For instance, for UNIX_TIMESTAMP, column ts, we set:{color}
> SELECT ... WHERE ts BETWEEN 1078016000 and 1718953003 ...
> And the table is also partitioned by ts with
> hoodie.keygen.timebased.output.dateformat = "-MM-dd HH"
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used

2024-07-04 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7952:

Description: 
Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
with nulls as partition columns values could lead to an empty query results.

HoodieFileIndex.listFiles() would return Seq of 
{color:#00}PartitionDirectory with null values.{color}

 

{color:#00}But also there is a problem with partition range filters.{color}

{color:#00}For instance, for UNIX_TIMESTAMP we set:
{color}

 

 

  was:
Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
with nulls as partition columns values could lead to an empty query results.

HoodieFileIndex.listFiles() would return Seq of 
{color:#00}PartitionDirectory with null values.{color}

 

{color:#00}But also there is a problem with {color}

 


> Incorrect partition pruning when TimestampBasedKeyGenerator is used
> ---
>
> Key: HUDI-7952
> URL: https://issues.apache.org/jira/browse/HUDI-7952
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
> with nulls as partition columns values could lead to an empty query results.
> HoodieFileIndex.listFiles() would return Seq of 
> {color:#00}PartitionDirectory with null values.{color}
>  
> {color:#00}But also there is a problem with partition range 
> filters.{color}
> {color:#00}For instance, for UNIX_TIMESTAMP we set:
> {color}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used

2024-07-04 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7952:

Description: 
Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
with nulls as partition columns values could lead to an empty query results.

HoodieFileIndex.listFiles() would return Seq of 
{color:#00}PartitionDirectory with null values.{color}

 

{color:#00}But also there is a problem with {color}

 

  was:
Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
with nulls as partition columns values could lead to an empty query results.

HoodieFileIndex.listFiles() would return Seq of 
{color:#00}PartitionDirectory with null values.
{color}

 


> Incorrect partition pruning when TimestampBasedKeyGenerator is used
> ---
>
> Key: HUDI-7952
> URL: https://issues.apache.org/jira/browse/HUDI-7952
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
> with nulls as partition columns values could lead to an empty query results.
> HoodieFileIndex.listFiles() would return Seq of 
> {color:#00}PartitionDirectory with null values.{color}
>  
> {color:#00}But also there is a problem with {color}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used

2024-07-04 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov reassigned HUDI-7952:
---

Assignee: Geser Dugarov

> Incorrect partition pruning when TimestampBasedKeyGenerator is used
> ---
>
> Key: HUDI-7952
> URL: https://issues.apache.org/jira/browse/HUDI-7952
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
> with nulls as partition columns values could lead to an empty query results.
> HoodieFileIndex.listFiles() would return Seq of 
> {color:#00}PartitionDirectory with null values.
> {color}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used

2024-07-04 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7952:

Description: 
Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
with nulls as partition columns values could lead to an empty query results.

HoodieFileIndex.listFiles() would return Seq of 
{color:#00}PartitionDirectory with null values.
{color}

 

  was:
Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
with nulls as partition columns values could lead to an empty query results.

HoodieFileIndex.listFiles() would return empty Seq of 
{color:#00}PartitionDirectory due to
{color}

 


> Incorrect partition pruning when TimestampBasedKeyGenerator is used
> ---
>
> Key: HUDI-7952
> URL: https://issues.apache.org/jira/browse/HUDI-7952
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Priority: Major
>
> Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
> with nulls as partition columns values could lead to an empty query results.
> HoodieFileIndex.listFiles() would return Seq of 
> {color:#00}PartitionDirectory with null values.
> {color}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used

2024-07-04 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7952:

Description: 
Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
with nulls as partition columns values could lead to an empty query results.

HoodieFileIndex.listFiles() would return empty 
{color:#00}PartitionDirectory 
{color}

  was:
Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
with nulls as partition columns values could lead to an empty query results.

HoodieFileIndex


> Incorrect partition pruning when TimestampBasedKeyGenerator is used
> ---
>
> Key: HUDI-7952
> URL: https://issues.apache.org/jira/browse/HUDI-7952
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Priority: Major
>
> Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
> with nulls as partition columns values could lead to an empty query results.
> HoodieFileIndex.listFiles() would return empty 
> {color:#00}PartitionDirectory 
> {color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used

2024-07-04 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7952:

Description: 
Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
with nulls as partition columns values could lead to an empty query results.

HoodieFileIndex.listFiles() would return empty Seq of 
{color:#00}PartitionDirectory due to
{color}

 

  was:
Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
with nulls as partition columns values could lead to an empty query results.

HoodieFileIndex.listFiles() would return empty 
{color:#00}PartitionDirectory 
{color}


> Incorrect partition pruning when TimestampBasedKeyGenerator is used
> ---
>
> Key: HUDI-7952
> URL: https://issues.apache.org/jira/browse/HUDI-7952
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Priority: Major
>
> Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
> with nulls as partition columns values could lead to an empty query results.
> HoodieFileIndex.listFiles() would return empty Seq of 
> {color:#00}PartitionDirectory due to
> {color}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used

2024-07-04 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7952:

Description: 
Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
with nulls as partition columns values could lead to an empty query results.

HoodieFileIndex

  was:Fix of ClassCastException in 
https://issues.apache.org/jira/browse/HUDI-7709 is missed of partition pruning 
check.


> Incorrect partition pruning when TimestampBasedKeyGenerator is used
> ---
>
> Key: HUDI-7952
> URL: https://issues.apache.org/jira/browse/HUDI-7952
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Priority: Major
>
> Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
> with nulls as partition columns values could lead to an empty query results.
> HoodieFileIndex



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used

2024-07-04 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7952:

Description: Fix of ClassCastException in 
https://issues.apache.org/jira/browse/HUDI-7709 is missed of partition pruning 
check.

> Incorrect partition pruning when TimestampBasedKeyGenerator is used
> ---
>
> Key: HUDI-7952
> URL: https://issues.apache.org/jira/browse/HUDI-7952
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Priority: Major
>
> Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 
> is missed of partition pruning check.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used

2024-07-04 Thread Geser Dugarov (Jira)
Geser Dugarov created HUDI-7952:
---

 Summary: Incorrect partition pruning when 
TimestampBasedKeyGenerator is used
 Key: HUDI-7952
 URL: https://issues.apache.org/jira/browse/HUDI-7952
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Geser Dugarov






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7938) NullPointerException during read from PySpark

2024-07-03 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862922#comment-17862922
 ] 

Geser Dugarov commented on HUDI-7938:
-

[~yihua] , if you don't mind, could you, please, clarify what to do with 
registration of Hudi serializer in Spark?

> NullPointerException during read from PySpark
> -
>
> Key: HUDI-7938
> URL: https://issues.apache.org/jira/browse/HUDI-7938
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> HUDI-7567 Add schema evolution to the filegroup reader (#10957),
> but broke integration with PySpark.
> When trying to call
> {quote}df_load = 
> spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path)
> df_load.collect()
> {quote}
>  
> got:
>  
> {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 
> (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException
>     at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:139)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> {quote}
> Spark 3.4.3 was used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7938) NullPointerException during read from PySpark

2024-07-03 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862921#comment-17862921
 ] 

Geser Dugarov commented on HUDI-7938:
-

To support run from PySpark without set spark.kryo.registrator this MR has been 
landed:

[https://github.com/apache/hudi/pull/11355]

But after landed

[https://github.com/apache/hudi/pull/10957]

we need to set it again.

For now, I don't know should we decide to make this configuration mandatory or 
make some changes in the code. Leave this task for some time as it is.

> NullPointerException during read from PySpark
> -
>
> Key: HUDI-7938
> URL: https://issues.apache.org/jira/browse/HUDI-7938
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> HUDI-7567 Add schema evolution to the filegroup reader (#10957),
> but broke integration with PySpark.
> When trying to call
> {quote}df_load = 
> spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path)
> df_load.collect()
> {quote}
>  
> got:
>  
> {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 
> (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException
>     at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:139)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> {quote}
> Spark 3.4.3 was used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7938) NullPointerException during read from PySpark

2024-07-03 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7938:

Status: Open  (was: In Progress)

> NullPointerException during read from PySpark
> -
>
> Key: HUDI-7938
> URL: https://issues.apache.org/jira/browse/HUDI-7938
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> HUDI-7567 Add schema evolution to the filegroup reader (#10957),
> but broke integration with PySpark.
> When trying to call
> {quote}df_load = 
> spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path)
> df_load.collect()
> {quote}
>  
> got:
>  
> {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 
> (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException
>     at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:139)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> {quote}
> Spark 3.4.3 was used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7938) NullPointerException during read from PySpark

2024-07-03 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862726#comment-17862726
 ] 

Geser Dugarov commented on HUDI-7938:
-

There is missed

spark.kryo.registrator = org.apache.spark.HoodieSparkKryoRegistrar

in configuration.

> NullPointerException during read from PySpark
> -
>
> Key: HUDI-7938
> URL: https://issues.apache.org/jira/browse/HUDI-7938
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> HUDI-7567 Add schema evolution to the filegroup reader (#10957),
> but broke integration with PySpark.
> When trying to call
> {quote}df_load = 
> spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path)
> df_load.collect()
> {quote}
>  
> got:
>  
> {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 
> (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException
>     at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:139)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> {quote}
> Spark 3.4.3 was used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7938) NullPointerException during read from PySpark

2024-07-03 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7938:

Status: In Progress  (was: Open)

> NullPointerException during read from PySpark
> -
>
> Key: HUDI-7938
> URL: https://issues.apache.org/jira/browse/HUDI-7938
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> HUDI-7567 Add schema evolution to the filegroup reader (#10957),
> but broke integration with PySpark.
> When trying to call
> {quote}df_load = 
> spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path)
> df_load.collect()
> {quote}
>  
> got:
>  
> {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 
> (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException
>     at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:139)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> {quote}
> Spark 3.4.3 was used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7938) NullPointerException during read from PySpark

2024-07-02 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862443#comment-17862443
 ] 

Geser Dugarov commented on HUDI-7938:
-

Also reproduced with Spark 3.5.1.

> NullPointerException during read from PySpark
> -
>
> Key: HUDI-7938
> URL: https://issues.apache.org/jira/browse/HUDI-7938
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> HUDI-7567 Add schema evolution to the filegroup reader (#10957),
> but broke integration with PySpark.
> When trying to call
> {quote}df_load = 
> spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path)
> df_load.collect()
> {quote}
>  
> got:
>  
> {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 
> (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException
>     at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:139)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> {quote}
> Spark 3.4.3 was used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7938) NullPointerException during read from PySpark

2024-06-28 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7938:

Description: 
HUDI-7567 Add schema evolution to the filegroup reader (#10957),

but broke integration with PySpark.

When trying to call
{quote}df_load = 
spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path)
df_load.collect()
{quote}
 
got:
 
{quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 (TID 
31) (10.199.141.90 executor 0): java.lang.NullPointerException
    at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
    at 
org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
    at 
org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
    at 
org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
    at 
org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
    at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
    at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
    at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
    at org.apache.spark.scheduler.Task.run(Task.scala:139)
    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
{quote}
Spark 3.4.3 was used.

  was:
HUDI-7567 Add schema evolution to the filegroup reader (#10957),

but broke integration with PySpark.

When trying to call
{quote}df_load = 
spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path)
df_load.collect(){quote}
 
got:
 
{quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 (TID 
31) (10.199.141.90 executor 0): java.lang.NullPointerException
    at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
    at 
org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
    at 
org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
    at 
org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
    at 
org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
    at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
    at 

[jira] [Updated] (HUDI-7938) NullPointerException during read from PySpark

2024-06-28 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7938:

Description: 
HUDI-7567 Add schema evolution to the filegroup reader (#10957),

but broke integration with PySpark.

When trying to call
{quote}df_load = 
spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path)
df_load.collect(){quote}
 
got:
 
{quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 (TID 
31) (10.199.141.90 executor 0): java.lang.NullPointerException
    at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
    at 
org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
    at 
org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
    at 
org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
    at 
org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
    at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
    at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
    at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
    at org.apache.spark.scheduler.Task.run(Task.scala:139)
    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
{quote}
 

  was:
HUDI-7567 Add schema evolution to the filegroup reader (#10957),

but broke integration with PySpark.

Got:

 
{quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 (TID 
31) (10.199.141.90 executor 0): java.lang.NullPointerException
    at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
    at 
org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
    at 
org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
    at 
org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
    at 
org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
    at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
    at 

[jira] [Updated] (HUDI-7938) NullPointerException during read from PySpark

2024-06-27 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7938:

Description: 
HUDI-7567 Add schema evolution to the filegroup reader (#10957),

but broke integration with PySpark.

Got:

 
{quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 (TID 
31) (10.199.141.90 executor 0): java.lang.NullPointerException
    at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
    at 
org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
    at 
org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
    at 
org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
    at 
org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
    at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
    at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
    at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
    at org.apache.spark.scheduler.Task.run(Task.scala:139)
    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
{quote}
 

  was:
HUDI-7567 Add schema evolution to the filegroup reader (#10957) broke 
integration with PySpark.

Got:

 
{quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 (TID 
31) (10.199.141.90 executor 0): java.lang.NullPointerException
    at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
    at 
org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
    at 
org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
    at 
org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
    at 
org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
    at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at 

[jira] [Assigned] (HUDI-7938) NullPointerException during read from PySpark

2024-06-27 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov reassigned HUDI-7938:
---

Assignee: Geser Dugarov

> NullPointerException during read from PySpark
> -
>
> Key: HUDI-7938
> URL: https://issues.apache.org/jira/browse/HUDI-7938
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> HUDI-7567 Add schema evolution to the filegroup reader (#10957) broke 
> integration with PySpark.
> Got:
>  
> {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 
> (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException
>     at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:139)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7938) NullPointerException during read from PySpark

2024-06-27 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7938:

Description: 
[HUDI-7567] Add schema evolution to the filegroup reader (#10957) broke 
integration with PySpark.

Got:

```

24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 (TID 31) 
(10.199.141.90 executor 0): java.lang.NullPointerException
    at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
    at 
org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
    at 
org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
    at 
org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
    at 
org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
    at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
    at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
    at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
    at org.apache.spark.scheduler.Task.run(Task.scala:139)
    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

```

> NullPointerException during read from PySpark
> -
>
> Key: HUDI-7938
> URL: https://issues.apache.org/jira/browse/HUDI-7938
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Priority: Major
>
> [HUDI-7567] Add schema evolution to the filegroup reader (#10957) broke 
> integration with PySpark.
> Got:
> ```
> 24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 (TID 31) 
> (10.199.141.90 executor 0): java.lang.NullPointerException
>     at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
>     at 
> 

[jira] [Updated] (HUDI-7938) NullPointerException during read from PySpark

2024-06-27 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7938:

Description: 
HUDI-7567 Add schema evolution to the filegroup reader (#10957) broke 
integration with PySpark.

Got:

 
{quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 (TID 
31) (10.199.141.90 executor 0): java.lang.NullPointerException
    at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
    at 
org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
    at 
org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
    at 
org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
    at 
org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
    at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
    at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
    at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
    at org.apache.spark.scheduler.Task.run(Task.scala:139)
    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
{quote}
 

  was:
[HUDI-7567] Add schema evolution to the filegroup reader (#10957) broke 
integration with PySpark.

Got:

```

24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 (TID 31) 
(10.199.141.90 executor 0): java.lang.NullPointerException
    at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
    at 
org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
    at 
org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
    at 
org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
    at 
org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
    at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at 

[jira] [Created] (HUDI-7938) NullPointerException during read from PySpark

2024-06-27 Thread Geser Dugarov (Jira)
Geser Dugarov created HUDI-7938:
---

 Summary: NullPointerException during read from PySpark
 Key: HUDI-7938
 URL: https://issues.apache.org/jira/browse/HUDI-7938
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Geser Dugarov






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6438) Fix issue while inserting non-nullable array columns to nullable columns

2024-06-27 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov closed HUDI-6438.
---
Resolution: Fixed

Fixed in HUDI-6219

> Fix issue while inserting non-nullable array columns to nullable columns
> 
>
> Key: HUDI-6438
> URL: https://issues.apache.org/jira/browse/HUDI-6438
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Github issue - [https://github.com/apache/hudi/issues/9042]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6438) Fix issue while inserting non-nullable array columns to nullable columns

2024-06-27 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov reassigned HUDI-6438:
---

Assignee: Geser Dugarov

> Fix issue while inserting non-nullable array columns to nullable columns
> 
>
> Key: HUDI-6438
> URL: https://issues.apache.org/jira/browse/HUDI-6438
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Github issue - [https://github.com/apache/hudi/issues/9042]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6219) Ensure consistency between Spark catalog schema and Hudi schema

2024-06-27 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov closed HUDI-6219.
---
Resolution: Fixed

> Ensure consistency between Spark catalog schema and Hudi schema
> ---
>
> Key: HUDI-6219
> URL: https://issues.apache.org/jira/browse/HUDI-6219
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wechar
>Priority: Major
>  Labels: pull-request-available
>
> [HUDI-4149|https://github.com/apache/hudi/pull/5672] fix the drop table error 
> if table directory moved, but it will make the Spark catalog table schema not 
> consistent with Hudi schema if some column types are not Avro data types.
> *Root cause:*
> Hudi schema is Avro types, but Spark catalog table schema is not. There are 
> two steps to record schema when create a hudi table:
> Step1: record the Avro compatible schema to .hoodie/hoodie.properties, 
> Step2: record table in Spark catalog
> The Step2 will use HoodieCatalog.tableSchema, which is table.schema now and 
> cause this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6438) Fix issue while inserting non-nullable array columns to nullable columns

2024-06-27 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-6438:

Fix Version/s: 0.14.0
   (was: 1.1.0)

> Fix issue while inserting non-nullable array columns to nullable columns
> 
>
> Key: HUDI-6438
> URL: https://issues.apache.org/jira/browse/HUDI-6438
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Github issue - [https://github.com/apache/hudi/issues/9042]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6219) Ensure consistency between Spark catalog schema and Hudi schema

2024-06-27 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-6219:

Fix Version/s: 0.14.0

> Ensure consistency between Spark catalog schema and Hudi schema
> ---
>
> Key: HUDI-6219
> URL: https://issues.apache.org/jira/browse/HUDI-6219
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wechar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> [HUDI-4149|https://github.com/apache/hudi/pull/5672] fix the drop table error 
> if table directory moved, but it will make the Spark catalog table schema not 
> consistent with Hudi schema if some column types are not Avro data types.
> *Root cause:*
> Hudi schema is Avro types, but Spark catalog table schema is not. There are 
> two steps to record schema when create a hudi table:
> Step1: record the Avro compatible schema to .hoodie/hoodie.properties, 
> Step2: record table in Spark catalog
> The Step2 will use HoodieCatalog.tableSchema, which is table.schema now and 
> cause this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7493) Clean configuration for clean service

2024-06-27 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov reassigned HUDI-7493:
---

Assignee: Geser Dugarov  (was: Lin Liu)

> Clean configuration for clean service
> -
>
> Key: HUDI-7493
> URL: https://issues.apache.org/jira/browse/HUDI-7493
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning, configs, table-service
>Reporter: Lin Liu
>Assignee: Geser Dugarov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0-beta2, 1.0.0
>
>
> Sometimes we use `{{{}hoodie.clean.*`{}}}  and sometimes 
> `{{{}hoodie.cleaner.*`.{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7487) Investigate flaky test in MERGE INTO

2024-06-27 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov reassigned HUDI-7487:
---

Assignee: Geser Dugarov

> Investigate flaky test in MERGE INTO
> 
>
> Key: HUDI-7487
> URL: https://issues.apache.org/jira/browse/HUDI-7487
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Geser Dugarov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> No production code changes, but this test started to fail:
> {code:java}
> - Test MERGE INTO with inserts only on MOR table when partial updates are 
> enabled *** FAILED ***
>   Expected Array([1,a1,10.0,1000,a1: desc1], [2,a2,20.0,1200,a2: desc2], 
> [3,a3,30.0,1250,a3: desc3], [4,a4,60.0,1270,a4: desc4]), but got 
> Array([1,a1,10.0,1000,a1: desc1], [2,a2,20.0,1200,a2: desc2], 
> [3,a3,30.0,1250,a3: desc3]) (HoodieSparkSqlTestBase.scala:109)
> 1564068 [ScalaTest-main-running-TestPartialUpdateForMergeInto] WARN  
> org.apache.hudi.common.table.TableSchemaResolver [] - Could not find any data 
> file written for commit, so could not get schema for table 
> file:/tmp/spark-037c0206-b70d-47ee-9f85-3b6fc12bf1a5/h9
> 1564072 [ScalaTest-main-running-TestPartialUpdateForMergeInto] WARN  
> org.apache.hudi.common.table.TableSchemaResolver [] - Could not find any data 
> file written for commit, so could not get schema for table 
> file:/tmp/spark-037c0206-b70d-47ee-9f85-3b6fc12bf1a5/h9
> 1564094 [ScalaTest-main-running-TestPartialUpdateForMergeInto] WARN  
> org.apache.hudi.common.table.TableSchemaResolver [] - Could not find any data 
> file written for commit, so could not get schema for table 
> file:/tmp/spark-037c0206-b70d-47ee-9f85-3b6fc12bf1a5/h10 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-6947) Clean up HoodieSparkSqlWriter.deduceWriterSchema

2024-06-27 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17860402#comment-17860402
 ] 

Geser Dugarov edited comment on HUDI-6947 at 6/27/24 9:51 AM:
--

Fixed in the master branch, 2e39bfb694099293b77eec9977e5e46af97af18b


was (Author: JIRAUSER301110):
Fixed in the master branch, cddd7d416a5db31de879790a80a33bb86cf02cbc

> Clean up HoodieSparkSqlWriter.deduceWriterSchema
> 
>
> Key: HUDI-6947
> URL: https://issues.apache.org/jira/browse/HUDI-6947
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, configs, spark, spark-sql
>Reporter: Jonathan Vexler
>Assignee: Geser Dugarov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> too many flags here:
> ADD_NULL_FOR_DELETED_COLUMNS
> RECONCILE_SCHEMA
> AVRO_SCHEMA_VALIDATE_ENABLE



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6947) Clean up HoodieSparkSqlWriter.deduceWriterSchema

2024-06-27 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17860402#comment-17860402
 ] 

Geser Dugarov commented on HUDI-6947:
-

Fixed in the master branch, cddd7d416a5db31de879790a80a33bb86cf02cbc

> Clean up HoodieSparkSqlWriter.deduceWriterSchema
> 
>
> Key: HUDI-6947
> URL: https://issues.apache.org/jira/browse/HUDI-6947
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, configs, spark, spark-sql
>Reporter: Jonathan Vexler
>Assignee: Geser Dugarov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> too many flags here:
> ADD_NULL_FOR_DELETED_COLUMNS
> RECONCILE_SCHEMA
> AVRO_SCHEMA_VALIDATE_ENABLE



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6947) Clean up HoodieSparkSqlWriter.deduceWriterSchema

2024-06-27 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov reassigned HUDI-6947:
---

Assignee: Geser Dugarov

> Clean up HoodieSparkSqlWriter.deduceWriterSchema
> 
>
> Key: HUDI-6947
> URL: https://issues.apache.org/jira/browse/HUDI-6947
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, configs, spark, spark-sql
>Reporter: Jonathan Vexler
>Assignee: Geser Dugarov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> too many flags here:
> ADD_NULL_FOR_DELETED_COLUMNS
> RECONCILE_SCHEMA
> AVRO_SCHEMA_VALIDATE_ENABLE



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7275) org.apache.hudi.TestHoodieSparkSqlWriter#testInsertDatasetWithTimelineTimezoneUTC causes issues with following tests

2024-06-27 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov reassigned HUDI-7275:
---

Assignee: Geser Dugarov

> org.apache.hudi.TestHoodieSparkSqlWriter#testInsertDatasetWithTimelineTimezoneUTC
>  causes issues with following tests
> 
>
> Key: HUDI-7275
> URL: https://issues.apache.org/jira/browse/HUDI-7275
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jonathan Vexler
>Assignee: Geser Dugarov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> When the next test runs, it gets stuck in an infinite loop and the output is 
> {code:java}
> 60331 [main] INFO  org.apache.hudi.common.table.timeline.TimeGeneratorBase [] 
> - Released the connection of the timeGenerator lock
> 60331 [main] INFO  org.apache.hudi.common.table.timeline.TimeGeneratorBase [] 
> - LockProvider for TimeGenerator: 
> org.apache.hudi.client.transaction.lock.InProcessLockProvider
> 60331 [main] INFO  
> org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path 
> /var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/hoodie_test_path7599985521109702031_1,
>  Lock Instance 
> java.util.concurrent.locks.ReentrantReadWriteLock@5d045508[Write locks = 0, 
> Read locks = 0], Thread main, In-process lock state ACQUIRING
> 60331 [main] INFO  
> org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path 
> /var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/hoodie_test_path7599985521109702031_1,
>  Lock Instance 
> java.util.concurrent.locks.ReentrantReadWriteLock@5d045508[Write locks = 1, 
> Read locks = 0], Thread main, In-process lock state ACQUIRED
> 60333 [main] INFO  
> org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path 
> /var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/hoodie_test_path7599985521109702031_1,
>  Lock Instance 
> java.util.concurrent.locks.ReentrantReadWriteLock@5d045508[Write locks = 1, 
> Read locks = 0], Thread main, In-process lock state RELEASING
> 60333 [main] INFO  
> org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path 
> /var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/hoodie_test_path7599985521109702031_1,
>  Lock Instance 
> java.util.concurrent.locks.ReentrantReadWriteLock@5d045508[Write locks = 0, 
> Read locks = 0], Thread main, In-process lock state RELEASED
> 60333 [main] INFO  
> org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path 
> /var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/hoodie_test_path7599985521109702031_1,
>  Lock Instance 
> java.util.concurrent.locks.ReentrantReadWriteLock@5d045508[Write locks = 0, 
> Read locks = 0], Thread main, In-process lock state ALREADY_RELEASED
> 60333 [main] INFO  org.apache.hudi.common.table.timeline.TimeGeneratorBase [] 
> - Released the connection of the timeGenerator lock
> 60333 [main] INFO  org.apache.hudi.common.table.timeline.TimeGeneratorBase [] 
> - LockProvider for TimeGenerator: 
> org.apache.hudi.client.transaction.lock.InProcessLockProvider
> 60333 [main] INFO  
> org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path 
> /var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/hoodie_test_path7599985521109702031_1,
>  Lock Instance 
> java.util.concurrent.locks.ReentrantReadWriteLock@5d045508[Write locks = 0, 
> Read locks = 0], Thread main, In-process lock state ACQUIRING
> 60333 [main] INFO  
> org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path 
> /var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/hoodie_test_path7599985521109702031_1,
>  Lock Instance 
> java.util.concurrent.locks.ReentrantReadWriteLock@5d045508[Write locks = 1, 
> Read locks = 0], Thread main, In-process lock state ACQUIRED
> 60334 [main] INFO  
> org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path 
> /var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/hoodie_test_path7599985521109702031_1,
>  Lock Instance 
> java.util.concurrent.locks.ReentrantReadWriteLock@5d045508[Write locks = 1, 
> Read locks = 0], Thread main, In-process lock state RELEASING
> 60334 [main] INFO  
> org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path 
> /var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/hoodie_test_path7599985521109702031_1,
>  Lock Instance 
> java.util.concurrent.locks.ReentrantReadWriteLock@5d045508[Write locks = 0, 
> Read locks = 0], Thread main, In-process lock state RELEASED
> 60334 [main] INFO  
> org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path 
> /var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/hoodie_test_path7599985521109702031_1,
>  Lock Instance 
> 

[jira] [Updated] (HUDI-7646) Consistent naming in Compaction service

2024-06-26 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7646:

Status: Open  (was: In Progress)

> Consistent naming in Compaction service
> ---
>
> Key: HUDI-7646
> URL: https://issues.apache.org/jira/browse/HUDI-7646
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Minor
> Fix For: 1.0.0
>
>
> The set of configuration parameters for Compaction service is confusing.
> In HoodieCompationConfig:
> * hoodie.compact.inline
> * hoodie.compact.schedule.inline
> * hoodie.log.compaction.enable
> * hoodie.log.compaction.inline
> * hoodie.compact.inline.max.delta.commits
> * hoodie.compact.inline.max.delta.seconds
> * hoodie.compact.inline.trigger.strategy
> * hoodie.parquet.small.file.limit
> * hoodie.record.size.estimation.threshold
> * hoodie.compaction.target.io
> * hoodie.compaction.logfile.size.threshold
> * hoodie.compaction.logfile.num.threshold
> * hoodie.compaction.strategy
> * hoodie.compaction.daybased.target.partitions
> * hoodie.copyonwrite.insert.split.size
> * hoodie.copyonwrite.insert.auto.split
> * hoodie.copyonwrite.record.size.estimate
> * hoodie.log.compaction.blocks.threshold
> In FlinkOptions:
> * compaction.async.enabled
> * compaction.schedule.enabled
> * compaction.delta_commits
> * compaction.delta_seconds
> * compaction.trigger.strategy
> * compaction.target_io
> * compaction.max_memory
> * compaction.tasks
> * compaction.timeout.seconds
> Need to refactor naming with saving backward compatibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7850) Makes hoodie.record.merge.mode mandatory upon creating the table and first write

2024-06-25 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7850:

Status: In Progress  (was: Open)

> Makes hoodie.record.merge.mode mandatory upon creating the table and first 
> write
> 
>
> Key: HUDI-7850
> URL: https://issues.apache.org/jira/browse/HUDI-7850
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Geser Dugarov
>Priority: Major
> Fix For: 1.0.0
>
>
> Right now, "hoodie.record.merge.mode" is optional during writes as it is 
> inferred from the payload class name, payload type, and the record merger 
> strategy during the creation of the table properties.  We should make this 
> config mandatory in release 1.0 and make other merge configs optional to 
> simplify the configuration experience.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7925) Implement logic for `shouldExtractPartitionValuesFromPartitionPath` in `HoodieHadoopFsRelationFactory`

2024-06-25 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7925:

Description: 
There is no logic for `shouldExtractPartitionValuesFromPartitionPath` in 
`HoodieHadoopFsRelationFactory`. Therefore during reading of data with 
"hoodie.file.group.reader.enabled" = "true", which is default behavior, we 
could got ClassCastException during extracting., for instance, see HUDI-7709.
Need to implement logic similar to `HoodieBaseRelation`.

  was:
There is no logic for `shouldExtractPartitionValuesFromPartitionPath` in 
`HoodieHadoopFsRelationFactory`. Therefore during reading of data with 
"hoodie.file.group.reader.enabled" = "true", which is default behavior, we 
could got ClassCastException during extracting., for instance, see .
Need to implement logic similar to `HoodieBaseRelation`.


> Implement logic for `shouldExtractPartitionValuesFromPartitionPath` in 
> `HoodieHadoopFsRelationFactory`
> --
>
> Key: HUDI-7925
> URL: https://issues.apache.org/jira/browse/HUDI-7925
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Geser Dugarov
>Priority: Major
>
> There is no logic for `shouldExtractPartitionValuesFromPartitionPath` in 
> `HoodieHadoopFsRelationFactory`. Therefore during reading of data with 
> "hoodie.file.group.reader.enabled" = "true", which is default behavior, we 
> could got ClassCastException during extracting., for instance, see HUDI-7709.
> Need to implement logic similar to `HoodieBaseRelation`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7925) Implement logic for `shouldExtractPartitionValuesFromPartitionPath` in `HoodieHadoopFsRelationFactory`

2024-06-25 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7925:

Description: 
There is no logic for `shouldExtractPartitionValuesFromPartitionPath` in 
`HoodieHadoopFsRelationFactory`. Therefore during reading of data with 
"hoodie.file.group.reader.enabled" = "true", which is default behavior, we 
could got ClassCastException during extracting., for instance, see .
Need to implement logic similar to `HoodieBaseRelation`.

  was:
There is no logic for `shouldExtractPartitionValuesFromPartitionPath` in 
`HoodieHadoopFsRelationFactory`. Therefore during reading of data with 
"hoodie.file.group.reader.enabled" = "true", which is default behavior, we got 
null values.
Need to implement logic similar to `HoodieBaseRelation`.


> Implement logic for `shouldExtractPartitionValuesFromPartitionPath` in 
> `HoodieHadoopFsRelationFactory`
> --
>
> Key: HUDI-7925
> URL: https://issues.apache.org/jira/browse/HUDI-7925
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Geser Dugarov
>Priority: Major
>
> There is no logic for `shouldExtractPartitionValuesFromPartitionPath` in 
> `HoodieHadoopFsRelationFactory`. Therefore during reading of data with 
> "hoodie.file.group.reader.enabled" = "true", which is default behavior, we 
> could got ClassCastException during extracting., for instance, see .
> Need to implement logic similar to `HoodieBaseRelation`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7925) Implement logic for `shouldExtractPartitionValuesFromPartitionPath` in `HoodieHadoopFsRelationFactory`

2024-06-24 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7925:

Summary: Implement logic for 
`shouldExtractPartitionValuesFromPartitionPath` in 
`HoodieHadoopFsRelationFactory`  (was: Do not extract values from partition 
paths in `HoodieHadoopFsRelationFactory`)

> Implement logic for `shouldExtractPartitionValuesFromPartitionPath` in 
> `HoodieHadoopFsRelationFactory`
> --
>
> Key: HUDI-7925
> URL: https://issues.apache.org/jira/browse/HUDI-7925
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Geser Dugarov
>Priority: Major
>
> There is no logic for `shouldExtractPartitionValuesFromPartitionPath` in 
> `HoodieHadoopFsRelationFactory`. Therefore during reading of data with 
> "hoodie.file.group.reader.enabled" = "true", which is default behavior, we 
> got null values.
> Need to implement logic similar to `HoodieBaseRelation`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7925) Do not extract values from partition paths in `HoodieHadoopFsRelationFactory`

2024-06-24 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7925:

Description: 
There is no logic for `shouldExtractPartitionValuesFromPartitionPath` in 
`HoodieHadoopFsRelationFactory`. Therefore during reading of data with 
"hoodie.file.group.reader.enabled" = "true", which is default behavior, we got 
null values.
Need to implement logic similar to `HoodieBaseRelation`.

  was:`shouldExtractPartitionValuesFromPartitionPath` is not used in 


> Do not extract values from partition paths in `HoodieHadoopFsRelationFactory`
> -
>
> Key: HUDI-7925
> URL: https://issues.apache.org/jira/browse/HUDI-7925
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Geser Dugarov
>Priority: Major
>
> There is no logic for `shouldExtractPartitionValuesFromPartitionPath` in 
> `HoodieHadoopFsRelationFactory`. Therefore during reading of data with 
> "hoodie.file.group.reader.enabled" = "true", which is default behavior, we 
> got null values.
> Need to implement logic similar to `HoodieBaseRelation`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7925) Do not extract values from partition paths in `HoodieHadoopFsRelationFactory`

2024-06-24 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7925:

Description: `shouldExtractPartitionValuesFromPartitionPath` is not used in 

> Do not extract values from partition paths in `HoodieHadoopFsRelationFactory`
> -
>
> Key: HUDI-7925
> URL: https://issues.apache.org/jira/browse/HUDI-7925
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Geser Dugarov
>Priority: Major
>
> `shouldExtractPartitionValuesFromPartitionPath` is not used in 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7925) Do not extract values from partition paths in `HoodieHadoopFsRelationFactory`

2024-06-24 Thread Geser Dugarov (Jira)
Geser Dugarov created HUDI-7925:
---

 Summary: Do not extract values from partition paths in 
`HoodieHadoopFsRelationFactory`
 Key: HUDI-7925
 URL: https://issues.apache.org/jira/browse/HUDI-7925
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Geser Dugarov






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-7033) Fix read error for schema evolution + partition value extraction

2024-06-24 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859598#comment-17859598
 ] 

Geser Dugarov edited comment on HUDI-7033 at 6/24/24 7:50 AM:
--

Merged a4fa3451916de11dc082792076b62013586dadaf in linked MR 9994
refers to [non-merged MR 9889|https://github.com/apache/hudi/pull/9889]


was (Author: JIRAUSER301110):
Merged a4fa3451916de11dc082792076b62013586dadaf
refers to [non-merged MR 9889|https://github.com/apache/hudi/pull/9889]

> Fix read error for schema evolution + partition value extraction
> 
>
> Key: HUDI-7033
> URL: https://issues.apache.org/jira/browse/HUDI-7033
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: voon
>Priority: Major
>  Labels: pull-request-available
>
> After HUDI-6960 is merged, there 
> *shouldExtractPartitionValuesFromPartitionPath* will correctly ignore 
> partition columns in requiredSchema.
>  
> When using the configs below, there will be read errors.
>  
> {code:java}
> hoodie.datasource.read.extract.partition.values.from.path = true {code}
>  
>  
> When the config above is added together with:
>  
> {code:java}
> hoodie.schema.on.read.enable = true {code}
>  
> The query schema will be pruned to **{*}NOT{*}** contain any partition 
> columns.
>  
> When rebuilding parquet filters, file schema's columns are scanned against 
> querySchema. However, Hudi files (file schema) might still contain partition 
> columns. And when partition filters are being rebuilt with these file schema 
> against query schema, it will lead to partition columns not being found.
>  
> {code:java}
> Caused by: java.lang.IllegalArgumentException: cannot found filter col 
> name:region from querySchema: table {
>  5: id: optional int
>  6: name: optional string
>  7: ts: optional long
> }
> at 
> org.apache.hudi.internal.schema.utils.InternalSchemaUtils.reBuildFilterName(InternalSchemaUtils.java:180)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HUDI-7033) Fix read error for schema evolution + partition value extraction

2024-06-24 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov reopened HUDI-7033:
-

Merged a4fa3451916de11dc082792076b62013586dadaf
refer to [non-merged MR 9889|https://github.com/apache/hudi/pull/9889]

> Fix read error for schema evolution + partition value extraction
> 
>
> Key: HUDI-7033
> URL: https://issues.apache.org/jira/browse/HUDI-7033
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: voon
>Priority: Major
>  Labels: pull-request-available
>
> After HUDI-6960 is merged, there 
> *shouldExtractPartitionValuesFromPartitionPath* will correctly ignore 
> partition columns in requiredSchema.
>  
> When using the configs below, there will be read errors.
>  
> {code:java}
> hoodie.datasource.read.extract.partition.values.from.path = true {code}
>  
>  
> When the config above is added together with:
>  
> {code:java}
> hoodie.schema.on.read.enable = true {code}
>  
> The query schema will be pruned to **{*}NOT{*}** contain any partition 
> columns.
>  
> When rebuilding parquet filters, file schema's columns are scanned against 
> querySchema. However, Hudi files (file schema) might still contain partition 
> columns. And when partition filters are being rebuilt with these file schema 
> against query schema, it will lead to partition columns not being found.
>  
> {code:java}
> Caused by: java.lang.IllegalArgumentException: cannot found filter col 
> name:region from querySchema: table {
>  5: id: optional int
>  6: name: optional string
>  7: ts: optional long
> }
> at 
> org.apache.hudi.internal.schema.utils.InternalSchemaUtils.reBuildFilterName(InternalSchemaUtils.java:180)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-7033) Fix read error for schema evolution + partition value extraction

2024-06-24 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859598#comment-17859598
 ] 

Geser Dugarov edited comment on HUDI-7033 at 6/24/24 7:47 AM:
--

Merged a4fa3451916de11dc082792076b62013586dadaf
refers to [non-merged MR 9889|https://github.com/apache/hudi/pull/9889]


was (Author: JIRAUSER301110):
Merged a4fa3451916de11dc082792076b62013586dadaf
refer to [non-merged MR 9889|https://github.com/apache/hudi/pull/9889]

> Fix read error for schema evolution + partition value extraction
> 
>
> Key: HUDI-7033
> URL: https://issues.apache.org/jira/browse/HUDI-7033
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: voon
>Priority: Major
>  Labels: pull-request-available
>
> After HUDI-6960 is merged, there 
> *shouldExtractPartitionValuesFromPartitionPath* will correctly ignore 
> partition columns in requiredSchema.
>  
> When using the configs below, there will be read errors.
>  
> {code:java}
> hoodie.datasource.read.extract.partition.values.from.path = true {code}
>  
>  
> When the config above is added together with:
>  
> {code:java}
> hoodie.schema.on.read.enable = true {code}
>  
> The query schema will be pruned to **{*}NOT{*}** contain any partition 
> columns.
>  
> When rebuilding parquet filters, file schema's columns are scanned against 
> querySchema. However, Hudi files (file schema) might still contain partition 
> columns. And when partition filters are being rebuilt with these file schema 
> against query schema, it will lead to partition columns not being found.
>  
> {code:java}
> Caused by: java.lang.IllegalArgumentException: cannot found filter col 
> name:region from querySchema: table {
>  5: id: optional int
>  6: name: optional string
>  7: ts: optional long
> }
> at 
> org.apache.hudi.internal.schema.utils.InternalSchemaUtils.reBuildFilterName(InternalSchemaUtils.java:180)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] (HUDI-7033) Fix read error for schema evolution + partition value extraction

2024-06-24 Thread Geser Dugarov (Jira)


[ https://issues.apache.org/jira/browse/HUDI-7033 ]


Geser Dugarov deleted comment on HUDI-7033:
-

was (Author: JIRAUSER301110):
Fixed in master, a4fa3451916de11dc082792076b62013586dadaf

> Fix read error for schema evolution + partition value extraction
> 
>
> Key: HUDI-7033
> URL: https://issues.apache.org/jira/browse/HUDI-7033
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: voon
>Priority: Major
>  Labels: pull-request-available
>
> After HUDI-6960 is merged, there 
> *shouldExtractPartitionValuesFromPartitionPath* will correctly ignore 
> partition columns in requiredSchema.
>  
> When using the configs below, there will be read errors.
>  
> {code:java}
> hoodie.datasource.read.extract.partition.values.from.path = true {code}
>  
>  
> When the config above is added together with:
>  
> {code:java}
> hoodie.schema.on.read.enable = true {code}
>  
> The query schema will be pruned to **{*}NOT{*}** contain any partition 
> columns.
>  
> When rebuilding parquet filters, file schema's columns are scanned against 
> querySchema. However, Hudi files (file schema) might still contain partition 
> columns. And when partition filters are being rebuilt with these file schema 
> against query schema, it will lead to partition columns not being found.
>  
> {code:java}
> Caused by: java.lang.IllegalArgumentException: cannot found filter col 
> name:region from querySchema: table {
>  5: id: optional int
>  6: name: optional string
>  7: ts: optional long
> }
> at 
> org.apache.hudi.internal.schema.utils.InternalSchemaUtils.reBuildFilterName(InternalSchemaUtils.java:180)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7033) Fix read error for schema evolution + partition value extraction

2024-06-24 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov closed HUDI-7033.
---
Resolution: Fixed

Fixed in master, a4fa3451916de11dc082792076b62013586dadaf

> Fix read error for schema evolution + partition value extraction
> 
>
> Key: HUDI-7033
> URL: https://issues.apache.org/jira/browse/HUDI-7033
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: voon
>Priority: Major
>  Labels: pull-request-available
>
> After HUDI-6960 is merged, there 
> *shouldExtractPartitionValuesFromPartitionPath* will correctly ignore 
> partition columns in requiredSchema.
>  
> When using the configs below, there will be read errors.
>  
> {code:java}
> hoodie.datasource.read.extract.partition.values.from.path = true {code}
>  
>  
> When the config above is added together with:
>  
> {code:java}
> hoodie.schema.on.read.enable = true {code}
>  
> The query schema will be pruned to **{*}NOT{*}** contain any partition 
> columns.
>  
> When rebuilding parquet filters, file schema's columns are scanned against 
> querySchema. However, Hudi files (file schema) might still contain partition 
> columns. And when partition filters are being rebuilt with these file schema 
> against query schema, it will lead to partition columns not being found.
>  
> {code:java}
> Caused by: java.lang.IllegalArgumentException: cannot found filter col 
> name:region from querySchema: table {
>  5: id: optional int
>  6: name: optional string
>  7: ts: optional long
> }
> at 
> org.apache.hudi.internal.schema.utils.InternalSchemaUtils.reBuildFilterName(InternalSchemaUtils.java:180)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6286) Overwrite mode should not delete old data

2024-06-18 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855820#comment-17855820
 ] 

Geser Dugarov commented on HUDI-6286:
-

Note that in HoodieWriteUtils.validateTableConfig() we skip all conflicts check 
between new and existing table configurations if it's an Overwrite save mode.

> Overwrite mode should not delete old data
> -
>
> Key: HUDI-6286
> URL: https://issues.apache.org/jira/browse/HUDI-6286
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
> Fix For: 1.1.0
>
>
> https://github.com/apache/hudi/pull/8076/files#r1127283648
> For *Overwrite* mode, we should not delete the basePath. Just overwrite the 
> existing data



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7847) Infer record merge mode during table upgrade

2024-06-10 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853633#comment-17853633
 ] 

Geser Dugarov commented on HUDI-7847:
-

Thanks for mentioning. I will reuse it.

> Infer record merge mode during table upgrade
> 
>
> Key: HUDI-7847
> URL: https://issues.apache.org/jira/browse/HUDI-7847
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Geser Dugarov
>Priority: Major
> Fix For: 1.0.0
>
>
> Record merge mode is required to dictate the merging behavior in release 1.x, 
> playing the same role as the payload class config in the release 0.x.  During 
> table upgrade, we need to infer the record merge mode based on the payload 
> class so it's correctly set.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7847) Infer record merge mode during table upgrade

2024-06-09 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov reassigned HUDI-7847:
---

Assignee: Geser Dugarov

> Infer record merge mode during table upgrade
> 
>
> Key: HUDI-7847
> URL: https://issues.apache.org/jira/browse/HUDI-7847
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Geser Dugarov
>Priority: Major
> Fix For: 1.0.0
>
>
> Record merge mode is required to dictate the merging behavior in release 1.x, 
> playing the same role as the payload class config in the release 0.x.  During 
> table upgrade, we need to infer the record merge mode based on the payload 
> class so it's correctly set.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7850) Makes hoodie.record.merge.mode mandatory upon creating the table and first write

2024-06-09 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov reassigned HUDI-7850:
---

Assignee: Geser Dugarov

> Makes hoodie.record.merge.mode mandatory upon creating the table and first 
> write
> 
>
> Key: HUDI-7850
> URL: https://issues.apache.org/jira/browse/HUDI-7850
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Geser Dugarov
>Priority: Major
> Fix For: 1.0.0
>
>
> Right now, "hoodie.record.merge.mode" is optional during writes as it is 
> inferred from the payload class name, payload type, and the record merger 
> strategy during the creation of the table properties.  We should make this 
> config mandatory in release 1.0 and make other merge configs optional to 
> simplify the configuration experience.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7827) Bump io.airlift:aircompressor from 0.25 to 0.27

2024-06-06 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov closed HUDI-7827.
---
Resolution: Fixed

Fixed in master, d0c7de050a8900a29f5d127093b378b96f9c5158

> Bump io.airlift:aircompressor from 0.25 to 0.27
> ---
>
> Key: HUDI-7827
> URL: https://issues.apache.org/jira/browse/HUDI-7827
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2024-05-20 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-3204:

Description: 
{color:#172b4d}Currently, b/c Spark by default omits partition values from the 
data files (instead encoding them into partition paths for partitioned tables), 
using `TimestampBasedKeyGenerator` w/ original timestamp based-column makes it 
impossible to retrieve the original value (reading from Spark) even though it's 
persisted in the data file as well.{color}

 
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
import org.apache.hudi.hive.MultiPartKeysValueExtractor

val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
"2018-09-24")).toDF("id", "name", "age", "ts", "data_date")

// mor
df.write.format("hudi").
option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
option("hoodie.datasource.write.recordkey.field", "id").
option("hoodie.datasource.write.partitionpath.field", "data_date").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.datasource.write.keygenerator.class", 
"org.apache.hudi.keygen.TimestampBasedKeyGenerator").
option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
option("hoodie.deltastreamer.keygen.timebased.output.dateformat", "/MM/dd").
option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
option("hoodie.deltastreamer.keygen.timebased.input.dateformat", "-MM-dd").
mode(org.apache.spark.sql.SaveMode.Append).
save("file:///tmp/hudi/issue_4417_mor")

+---++--+--++---++---+---+--+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name| id|name|age| ts| data_date|
+---++--+--++---++---+---+--+
|  20220110172709324|20220110172709324...|                 2|            
2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
|  20220110172709324|20220110172709324...|                 1|            
2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
+---++--+--++---++---+---+--+

// can not query any data
spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
 = '2018-09-24'")
// still can not query any data
spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
 = '2018/09/24'").show 

// cow
df.write.format("hudi").
option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
option("hoodie.datasource.write.recordkey.field", "id").
option("hoodie.datasource.write.partitionpath.field", "data_date").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.datasource.write.keygenerator.class", 
"org.apache.hudi.keygen.TimestampBasedKeyGenerator").
option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
option("hoodie.deltastreamer.keygen.timebased.output.dateformat", "/MM/dd").
option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
option("hoodie.deltastreamer.keygen.timebased.input.dateformat", "-MM-dd").
mode(org.apache.spark.sql.SaveMode.Append).
save("file:///tmp/hudi/issue_4417_cow") 

+---++--+--++---++---+---+--+
 
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name| id|name|age| ts| data_date|
 
+---++--+--++---++---+---+--+
 |  20220110172721896|20220110172721896...|                 2|            
2018/09/24|81cc7819-a0d1-4e6...|  2|  z3| 35| v1|2018/09/24|
 |  20220110172721896|20220110172721896...|                 1|            
2018/09/23|d428019b-a829-41a...|  1|  z3| 30| v1|2018/09/23|
 
+---++--+--++---++---+---+--+
 
// can not query any data
spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_cow").where("data_date
 = '2018-09-24'").show

// but 2018/09/24 works
spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_cow").where("data_date
 = '2018/09/24'").show  {code}
 

 

  was:
{color:#172b4d}Currently, b/c Spark by default omits partition values from the 
data files (instead encoding them into partition paths for partitioned tables), 
using `TimestampBasedKeyGenerator` w/ original 

[jira] [Comment Edited] (HUDI-7709) Class Cast Exception while reading the data using TimestampBasedKeyGenerator

2024-05-20 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848040#comment-17848040
 ] 

Geser Dugarov edited comment on HUDI-7709 at 5/21/24 4:23 AM:
--

The issue is related to HUDI-3204.
Spark by default retrieves values for partitioning column from partition paths. 
We couldn't do it for TimestampBasedKeyGenerator due to lost data after user 
defined transformations in "hoodie.keygen.timebased.output.dateformat".
Looking for proper fixing.


was (Author: JIRAUSER301110):
The issue related to HUDI-3204.
Spark by default retrieves values for partitioning column from partition paths. 
We couldn't do it for TimestampBasedKeyGenerator due to lost data after user 
defined transformations in "hoodie.keygen.timebased.output.dateformat".
Looking for proper fixing.

> Class Cast Exception while reading the data using TimestampBasedKeyGenerator
> 
>
> Key: HUDI-7709
> URL: https://issues.apache.org/jira/browse/HUDI-7709
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
> Fix For: 1.0.0
>
>
> Github Issue - [https://github.com/apache/hudi/issues/11140]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-7709) Class Cast Exception while reading the data using TimestampBasedKeyGenerator

2024-05-20 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848040#comment-17848040
 ] 

Geser Dugarov edited comment on HUDI-7709 at 5/21/24 4:22 AM:
--

The issue related to HUDI-3204.
Spark by default retrieves values for partitioning column from partition paths. 
We couldn't do it for TimestampBasedKeyGenerator due to lost data after user 
defined transformations in "hoodie.keygen.timebased.output.dateformat".
Looking for proper fixing.


was (Author: JIRAUSER301110):
The issue related to HUDI-3204. Spark by default retrieve values for 
partitioning column from partition paths. We couldn't do it for 
TimestampBasedKeyGenerator due to lost data after user defined transformations 
in hoodie.keygen.timebased.output.dateformat.

> Class Cast Exception while reading the data using TimestampBasedKeyGenerator
> 
>
> Key: HUDI-7709
> URL: https://issues.apache.org/jira/browse/HUDI-7709
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
> Fix For: 1.0.0
>
>
> Github Issue - [https://github.com/apache/hudi/issues/11140]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7709) Class Cast Exception while reading the data using TimestampBasedKeyGenerator

2024-05-20 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848040#comment-17848040
 ] 

Geser Dugarov commented on HUDI-7709:
-

The issue related to HUDI-3204. Spark by default retrieve values for 
partitioning column from partition paths. We couldn't do it for 
TimestampBasedKeyGenerator due to lost data after user defined transformations 
in hoodie.keygen.timebased.output.dateformat.

> Class Cast Exception while reading the data using TimestampBasedKeyGenerator
> 
>
> Key: HUDI-7709
> URL: https://issues.apache.org/jira/browse/HUDI-7709
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
> Fix For: 1.0.0
>
>
> Github Issue - [https://github.com/apache/hudi/issues/11140]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7709) Class Cast Exception while reading the data using TimestampBasedKeyGenerator

2024-05-20 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7709:

Status: In Progress  (was: Open)

> Class Cast Exception while reading the data using TimestampBasedKeyGenerator
> 
>
> Key: HUDI-7709
> URL: https://issues.apache.org/jira/browse/HUDI-7709
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
> Fix For: 1.0.0
>
>
> Github Issue - [https://github.com/apache/hudi/issues/11140]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7709) Class Cast Exception while reading the data using TimestampBasedKeyGenerator

2024-05-16 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov reassigned HUDI-7709:
---

Assignee: Geser Dugarov

> Class Cast Exception while reading the data using TimestampBasedKeyGenerator
> 
>
> Key: HUDI-7709
> URL: https://issues.apache.org/jira/browse/HUDI-7709
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
> Fix For: 0.15.0
>
>
> Github Issue - [https://github.com/apache/hudi/issues/11140]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7709) Class Cast Exception while reading the data using TimestampBasedKeyGenerator

2024-05-16 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7709:

Fix Version/s: 1.0.0
   (was: 0.15.0)

> Class Cast Exception while reading the data using TimestampBasedKeyGenerator
> 
>
> Key: HUDI-7709
> URL: https://issues.apache.org/jira/browse/HUDI-7709
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
> Fix For: 1.0.0
>
>
> Github Issue - [https://github.com/apache/hudi/issues/11140]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] (HUDI-7717) hoodie.combine.before.insert silently broken for bulk_insert if meta fields disabled (causes duplicates)

2024-05-16 Thread Geser Dugarov (Jira)


[ https://issues.apache.org/jira/browse/HUDI-7717 ]


Geser Dugarov deleted comment on HUDI-7717:
-

was (Author: JIRAUSER301110):
Fixed in master branch: 7fc5adad7aa9787e961c36536a08622f62fabe49

> hoodie.combine.before.insert silently broken for bulk_insert if meta fields 
> disabled (causes duplicates)
> 
>
> Key: HUDI-7717
> URL: https://issues.apache.org/jira/browse/HUDI-7717
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Github issue - [https://github.com/apache/hudi/issues/11044]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7717) hoodie.combine.before.insert silently broken for bulk_insert if meta fields disabled (causes duplicates)

2024-05-16 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847023#comment-17847023
 ] 

Geser Dugarov commented on HUDI-7717:
-

Fixed in master branch: 7fc5adad7aa9787e961c36536a08622f62fabe49

> hoodie.combine.before.insert silently broken for bulk_insert if meta fields 
> disabled (causes duplicates)
> 
>
> Key: HUDI-7717
> URL: https://issues.apache.org/jira/browse/HUDI-7717
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Github issue - [https://github.com/apache/hudi/issues/11044]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7717) hoodie.combine.before.insert silently broken for bulk_insert if meta fields disabled (causes duplicates)

2024-05-16 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov closed HUDI-7717.
---
Resolution: Fixed

Fixed in master branch: 7fc5adad7aa9787e961c36536a08622f62fabe49

> hoodie.combine.before.insert silently broken for bulk_insert if meta fields 
> disabled (causes duplicates)
> 
>
> Key: HUDI-7717
> URL: https://issues.apache.org/jira/browse/HUDI-7717
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Github issue - [https://github.com/apache/hudi/issues/11044]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-7717) hoodie.combine.before.insert silently broken for bulk_insert if meta fields disabled (causes duplicates)

2024-05-16 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov resolved HUDI-7717.
-

> hoodie.combine.before.insert silently broken for bulk_insert if meta fields 
> disabled (causes duplicates)
> 
>
> Key: HUDI-7717
> URL: https://issues.apache.org/jira/browse/HUDI-7717
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Github issue - [https://github.com/apache/hudi/issues/11044]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7717) hoodie.combine.before.insert silently broken for bulk_insert if meta fields disabled (causes duplicates)

2024-05-15 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846808#comment-17846808
 ] 

Geser Dugarov commented on HUDI-7717:
-

MR with fix is under review.

> hoodie.combine.before.insert silently broken for bulk_insert if meta fields 
> disabled (causes duplicates)
> 
>
> Key: HUDI-7717
> URL: https://issues.apache.org/jira/browse/HUDI-7717
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Github issue - [https://github.com/apache/hudi/issues/11044]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7717) hoodie.combine.before.insert silently broken for bulk_insert if meta fields disabled (causes duplicates)

2024-05-15 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7717:

Fix Version/s: 1.0.0
   (was: 0.15.0)

> hoodie.combine.before.insert silently broken for bulk_insert if meta fields 
> disabled (causes duplicates)
> 
>
> Key: HUDI-7717
> URL: https://issues.apache.org/jira/browse/HUDI-7717
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Github issue - [https://github.com/apache/hudi/issues/11044]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7757) Revisit shortcut for bulk insert with enabled row writer

2024-05-14 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7757:

Description: There is return statement in the middle of huge function 
HoodieSparkSqlWrite.writeInternal().

> Revisit shortcut for bulk insert with enabled row writer
> 
>
> Key: HUDI-7757
> URL: https://issues.apache.org/jira/browse/HUDI-7757
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> There is return statement in the middle of huge function 
> HoodieSparkSqlWrite.writeInternal().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7757) Revisit shortcut for bulk insert with enabled row writer

2024-05-14 Thread Geser Dugarov (Jira)
Geser Dugarov created HUDI-7757:
---

 Summary: Revisit shortcut for bulk insert with enabled row writer
 Key: HUDI-7757
 URL: https://issues.apache.org/jira/browse/HUDI-7757
 Project: Apache Hudi
  Issue Type: Task
Reporter: Geser Dugarov






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7757) Revisit shortcut for bulk insert with enabled row writer

2024-05-14 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov reassigned HUDI-7757:
---

Assignee: Geser Dugarov

> Revisit shortcut for bulk insert with enabled row writer
> 
>
> Key: HUDI-7757
> URL: https://issues.apache.org/jira/browse/HUDI-7757
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7646) Consistent naming in Compaction service

2024-05-09 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7646:

Fix Version/s: 1.0.0

> Consistent naming in Compaction service
> ---
>
> Key: HUDI-7646
> URL: https://issues.apache.org/jira/browse/HUDI-7646
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Minor
> Fix For: 1.0.0
>
>
> The set of configuration parameters for Compaction service is confusing.
> In HoodieCompationConfig:
> * hoodie.compact.inline
> * hoodie.compact.schedule.inline
> * hoodie.log.compaction.enable
> * hoodie.log.compaction.inline
> * hoodie.compact.inline.max.delta.commits
> * hoodie.compact.inline.max.delta.seconds
> * hoodie.compact.inline.trigger.strategy
> * hoodie.parquet.small.file.limit
> * hoodie.record.size.estimation.threshold
> * hoodie.compaction.target.io
> * hoodie.compaction.logfile.size.threshold
> * hoodie.compaction.logfile.num.threshold
> * hoodie.compaction.strategy
> * hoodie.compaction.daybased.target.partitions
> * hoodie.copyonwrite.insert.split.size
> * hoodie.copyonwrite.insert.auto.split
> * hoodie.copyonwrite.record.size.estimate
> * hoodie.log.compaction.blocks.threshold
> In FlinkOptions:
> * compaction.async.enabled
> * compaction.schedule.enabled
> * compaction.delta_commits
> * compaction.delta_seconds
> * compaction.trigger.strategy
> * compaction.target_io
> * compaction.max_memory
> * compaction.tasks
> * compaction.timeout.seconds
> Need to refactor naming with saving backward compatibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7646) Consistent naming in Compaction service

2024-05-09 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845082#comment-17845082
 ] 

Geser Dugarov commented on HUDI-7646:
-

Prepared local environment for TPC-H benchmark running. I will research 
Compaction parameters configuration from user point of view.

> Consistent naming in Compaction service
> ---
>
> Key: HUDI-7646
> URL: https://issues.apache.org/jira/browse/HUDI-7646
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Minor
>
> The set of configuration parameters for Compaction service is confusing.
> In HoodieCompationConfig:
> * hoodie.compact.inline
> * hoodie.compact.schedule.inline
> * hoodie.log.compaction.enable
> * hoodie.log.compaction.inline
> * hoodie.compact.inline.max.delta.commits
> * hoodie.compact.inline.max.delta.seconds
> * hoodie.compact.inline.trigger.strategy
> * hoodie.parquet.small.file.limit
> * hoodie.record.size.estimation.threshold
> * hoodie.compaction.target.io
> * hoodie.compaction.logfile.size.threshold
> * hoodie.compaction.logfile.num.threshold
> * hoodie.compaction.strategy
> * hoodie.compaction.daybased.target.partitions
> * hoodie.copyonwrite.insert.split.size
> * hoodie.copyonwrite.insert.auto.split
> * hoodie.copyonwrite.record.size.estimate
> * hoodie.log.compaction.blocks.threshold
> In FlinkOptions:
> * compaction.async.enabled
> * compaction.schedule.enabled
> * compaction.delta_commits
> * compaction.delta_seconds
> * compaction.trigger.strategy
> * compaction.target_io
> * compaction.max_memory
> * compaction.tasks
> * compaction.timeout.seconds
> Need to refactor naming with saving backward compatibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7737) Bump Spark 3.4 version to Spark 3.4.3

2024-05-09 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov closed HUDI-7737.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

> Bump Spark 3.4 version to Spark 3.4.3
> -
>
> Key: HUDI-7737
> URL: https://issues.apache.org/jira/browse/HUDI-7737
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Spark 3.4.3 has been released: https://github.com/apache/spark/tree/v3.4.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7737) Bump Spark 3.4 version to Spark 3.4.3

2024-05-09 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845076#comment-17845076
 ] 

Geser Dugarov commented on HUDI-7737:
-

Fixed via master branch: cdd146b2c73d50a28bee9f712b689df4fc923222

> Bump Spark 3.4 version to Spark 3.4.3
> -
>
> Key: HUDI-7737
> URL: https://issues.apache.org/jira/browse/HUDI-7737
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Minor
>  Labels: pull-request-available
>
> Spark 3.4.3 has been released: https://github.com/apache/spark/tree/v3.4.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-7737) Bump Spark 3.4 version to Spark 3.4.3

2024-05-09 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov resolved HUDI-7737.
-

> Bump Spark 3.4 version to Spark 3.4.3
> -
>
> Key: HUDI-7737
> URL: https://issues.apache.org/jira/browse/HUDI-7737
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Minor
>  Labels: pull-request-available
>
> Spark 3.4.3 has been released: https://github.com/apache/spark/tree/v3.4.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7737) Bump Spark 3.4 version to Spark 3.4.3

2024-05-09 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7737:

Priority: Minor  (was: Major)

> Bump Spark 3.4 version to Spark 3.4.3
> -
>
> Key: HUDI-7737
> URL: https://issues.apache.org/jira/browse/HUDI-7737
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Minor
>  Labels: pull-request-available
>
> Spark 3.4.3 has been released: https://github.com/apache/spark/tree/v3.4.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-7717) hoodie.combine.before.insert silently broken for bulk_insert if meta fields disabled (causes duplicates)

2024-05-09 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844891#comment-17844891
 ] 

Geser Dugarov edited comment on HUDI-7717 at 5/9/24 7:24 AM:
-

Working on local PySpark environment deployment and configuration for quick 
checking.
I suppose that change of Spark SaveMode from Overwrite to Append could lead to 
expected behavior.


was (Author: JIRAUSER301110):
Working on local PySpark environment setting for quick checking.
I suppose that change of Spark SaveMode from Overwrite to Append could lead to 
expected behavior.

> hoodie.combine.before.insert silently broken for bulk_insert if meta fields 
> disabled (causes duplicates)
> 
>
> Key: HUDI-7717
> URL: https://issues.apache.org/jira/browse/HUDI-7717
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
> Fix For: 0.15.0
>
>
> Github issue - [https://github.com/apache/hudi/issues/11044]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7737) Bump Spark 3.4 version to Spark 3.4.3

2024-05-09 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7737:

Description: Spark 3.4.3 has been released: 
https://github.com/apache/spark/tree/v3.4.3

> Bump Spark 3.4 version to Spark 3.4.3
> -
>
> Key: HUDI-7737
> URL: https://issues.apache.org/jira/browse/HUDI-7737
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> Spark 3.4.3 has been released: https://github.com/apache/spark/tree/v3.4.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7737) Bump Spark 3.4 version to Spark 3.4.3

2024-05-09 Thread Geser Dugarov (Jira)
Geser Dugarov created HUDI-7737:
---

 Summary: Bump Spark 3.4 version to Spark 3.4.3
 Key: HUDI-7737
 URL: https://issues.apache.org/jira/browse/HUDI-7737
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Geser Dugarov






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7737) Bump Spark 3.4 version to Spark 3.4.3

2024-05-09 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov reassigned HUDI-7737:
---

Assignee: Geser Dugarov

> Bump Spark 3.4 version to Spark 3.4.3
> -
>
> Key: HUDI-7737
> URL: https://issues.apache.org/jira/browse/HUDI-7737
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7717) hoodie.combine.before.insert silently broken for bulk_insert if meta fields disabled (causes duplicates)

2024-05-08 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov reassigned HUDI-7717:
---

Assignee: Geser Dugarov

> hoodie.combine.before.insert silently broken for bulk_insert if meta fields 
> disabled (causes duplicates)
> 
>
> Key: HUDI-7717
> URL: https://issues.apache.org/jira/browse/HUDI-7717
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
> Fix For: 0.15.0
>
>
> Github issue - [https://github.com/apache/hudi/issues/11044]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7646) Consistent naming in Compaction service

2024-04-24 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov reassigned HUDI-7646:
---

Assignee: Geser Dugarov

> Consistent naming in Compaction service
> ---
>
> Key: HUDI-7646
> URL: https://issues.apache.org/jira/browse/HUDI-7646
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Minor
>
> The set of configuration parameters for Compaction service is confusing.
> In HoodieCompationConfig:
> * hoodie.compact.inline
> * hoodie.compact.schedule.inline
> * hoodie.log.compaction.enable
> * hoodie.log.compaction.inline
> * hoodie.compact.inline.max.delta.commits
> * hoodie.compact.inline.max.delta.seconds
> * hoodie.compact.inline.trigger.strategy
> * hoodie.parquet.small.file.limit
> * hoodie.record.size.estimation.threshold
> * hoodie.compaction.target.io
> * hoodie.compaction.logfile.size.threshold
> * hoodie.compaction.logfile.num.threshold
> * hoodie.compaction.strategy
> * hoodie.compaction.daybased.target.partitions
> * hoodie.copyonwrite.insert.split.size
> * hoodie.copyonwrite.insert.auto.split
> * hoodie.copyonwrite.record.size.estimate
> * hoodie.log.compaction.blocks.threshold
> In FlinkOptions:
> * compaction.async.enabled
> * compaction.schedule.enabled
> * compaction.delta_commits
> * compaction.delta_seconds
> * compaction.trigger.strategy
> * compaction.target_io
> * compaction.max_memory
> * compaction.tasks
> * compaction.timeout.seconds
> Need to refactor naming with saving backward compatibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7646) Consistent naming in Compaction service

2024-04-24 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7646:

Status: In Progress  (was: Open)

> Consistent naming in Compaction service
> ---
>
> Key: HUDI-7646
> URL: https://issues.apache.org/jira/browse/HUDI-7646
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Geser Dugarov
>Priority: Minor
>
> The set of configuration parameters for Compaction service is confusing.
> In HoodieCompationConfig:
> * hoodie.compact.inline
> * hoodie.compact.schedule.inline
> * hoodie.log.compaction.enable
> * hoodie.log.compaction.inline
> * hoodie.compact.inline.max.delta.commits
> * hoodie.compact.inline.max.delta.seconds
> * hoodie.compact.inline.trigger.strategy
> * hoodie.parquet.small.file.limit
> * hoodie.record.size.estimation.threshold
> * hoodie.compaction.target.io
> * hoodie.compaction.logfile.size.threshold
> * hoodie.compaction.logfile.num.threshold
> * hoodie.compaction.strategy
> * hoodie.compaction.daybased.target.partitions
> * hoodie.copyonwrite.insert.split.size
> * hoodie.copyonwrite.insert.auto.split
> * hoodie.copyonwrite.record.size.estimate
> * hoodie.log.compaction.blocks.threshold
> In FlinkOptions:
> * compaction.async.enabled
> * compaction.schedule.enabled
> * compaction.delta_commits
> * compaction.delta_seconds
> * compaction.trigger.strategy
> * compaction.target_io
> * compaction.max_memory
> * compaction.tasks
> * compaction.timeout.seconds
> Need to refactor naming with saving backward compatibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7646) Consistent naming in Compaction service

2024-04-22 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839541#comment-17839541
 ] 

Geser Dugarov commented on HUDI-7646:
-

The main question is using ".inline" vs ".async". The current distribution is 
the following.

Using ".inline":
* hoodie.compact.inline
* hoodie.compact.schedule.inline
* hoodie.log.compaction.inline
* hoodie.clustering.inline
* hoodie.clustering.schedule.inline
* hoodie.partition.ttl.inline

Using ".async":
* hoodie.clean.async.enabled
* clean.async.enabled
* compaction.async.enabled
* hoodie.kafka.compaction.async.enable
* hoodie.clustering.async.enabled
* clustering.async.enabled
* hoodie.archive.async
* hoodie.embed.timeline.server.async
* hoodie.metadata.index.async
* hoodie.datasource.compaction.async.enable

Looks like it's preferable to move toward ".async" option.

And from user point of view, it's more obvious what ".async" means in comparing 
with ".inline", which needs to clarify the Hudi write process for a user.

> Consistent naming in Compaction service
> ---
>
> Key: HUDI-7646
> URL: https://issues.apache.org/jira/browse/HUDI-7646
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Geser Dugarov
>Priority: Minor
>
> The set of configuration parameters for Compaction service is confusing.
> In HoodieCompationConfig:
> * hoodie.compact.inline
> * hoodie.compact.schedule.inline
> * hoodie.log.compaction.enable
> * hoodie.log.compaction.inline
> * hoodie.compact.inline.max.delta.commits
> * hoodie.compact.inline.max.delta.seconds
> * hoodie.compact.inline.trigger.strategy
> * hoodie.parquet.small.file.limit
> * hoodie.record.size.estimation.threshold
> * hoodie.compaction.target.io
> * hoodie.compaction.logfile.size.threshold
> * hoodie.compaction.logfile.num.threshold
> * hoodie.compaction.strategy
> * hoodie.compaction.daybased.target.partitions
> * hoodie.copyonwrite.insert.split.size
> * hoodie.copyonwrite.insert.auto.split
> * hoodie.copyonwrite.record.size.estimate
> * hoodie.log.compaction.blocks.threshold
> In FlinkOptions:
> * compaction.async.enabled
> * compaction.schedule.enabled
> * compaction.delta_commits
> * compaction.delta_seconds
> * compaction.trigger.strategy
> * compaction.target_io
> * compaction.max_memory
> * compaction.tasks
> * compaction.timeout.seconds
> Need to refactor naming with saving backward compatibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-7646) Consistent naming in Compaction service

2024-04-22 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839541#comment-17839541
 ] 

Geser Dugarov edited comment on HUDI-7646 at 4/22/24 8:11 AM:
--

The main question is what options are preferable, with ".inline" or with 
".async" naming. The current distribution is the following.

Using ".inline":
* hoodie.compact.inline
* hoodie.compact.schedule.inline
* hoodie.log.compaction.inline
* hoodie.clustering.inline
* hoodie.clustering.schedule.inline
* hoodie.partition.ttl.inline

Using ".async":
* hoodie.clean.async.enabled
* clean.async.enabled
* compaction.async.enabled
* hoodie.kafka.compaction.async.enable
* hoodie.clustering.async.enabled
* clustering.async.enabled
* hoodie.archive.async
* hoodie.embed.timeline.server.async
* hoodie.metadata.index.async
* hoodie.datasource.compaction.async.enable

Looks like it's preferable to move toward ".async" option.

And from user point of view, it's more obvious what ".async" means in comparing 
with ".inline", which needs to clarify the Hudi write process for a user.


was (Author: JIRAUSER301110):
The main question is using ".inline" vs ".async". The current distribution is 
the following.

Using ".inline":
* hoodie.compact.inline
* hoodie.compact.schedule.inline
* hoodie.log.compaction.inline
* hoodie.clustering.inline
* hoodie.clustering.schedule.inline
* hoodie.partition.ttl.inline

Using ".async":
* hoodie.clean.async.enabled
* clean.async.enabled
* compaction.async.enabled
* hoodie.kafka.compaction.async.enable
* hoodie.clustering.async.enabled
* clustering.async.enabled
* hoodie.archive.async
* hoodie.embed.timeline.server.async
* hoodie.metadata.index.async
* hoodie.datasource.compaction.async.enable

Looks like it's preferable to move toward ".async" option.

And from user point of view, it's more obvious what ".async" means in comparing 
with ".inline", which needs to clarify the Hudi write process for a user.

> Consistent naming in Compaction service
> ---
>
> Key: HUDI-7646
> URL: https://issues.apache.org/jira/browse/HUDI-7646
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Geser Dugarov
>Priority: Minor
>
> The set of configuration parameters for Compaction service is confusing.
> In HoodieCompationConfig:
> * hoodie.compact.inline
> * hoodie.compact.schedule.inline
> * hoodie.log.compaction.enable
> * hoodie.log.compaction.inline
> * hoodie.compact.inline.max.delta.commits
> * hoodie.compact.inline.max.delta.seconds
> * hoodie.compact.inline.trigger.strategy
> * hoodie.parquet.small.file.limit
> * hoodie.record.size.estimation.threshold
> * hoodie.compaction.target.io
> * hoodie.compaction.logfile.size.threshold
> * hoodie.compaction.logfile.num.threshold
> * hoodie.compaction.strategy
> * hoodie.compaction.daybased.target.partitions
> * hoodie.copyonwrite.insert.split.size
> * hoodie.copyonwrite.insert.auto.split
> * hoodie.copyonwrite.record.size.estimate
> * hoodie.log.compaction.blocks.threshold
> In FlinkOptions:
> * compaction.async.enabled
> * compaction.schedule.enabled
> * compaction.delta_commits
> * compaction.delta_seconds
> * compaction.trigger.strategy
> * compaction.target_io
> * compaction.max_memory
> * compaction.tasks
> * compaction.timeout.seconds
> Need to refactor naming with saving backward compatibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7646) Consistent naming in Compaction service

2024-04-22 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7646:

Description: 
The set of configuration parameters for Compaction service is confusing.

In HoodieCompationConfig:
* hoodie.compact.inline
* hoodie.compact.schedule.inline
* hoodie.log.compaction.enable
* hoodie.log.compaction.inline
* hoodie.compact.inline.max.delta.commits
* hoodie.compact.inline.max.delta.seconds
* hoodie.compact.inline.trigger.strategy
* hoodie.parquet.small.file.limit
* hoodie.record.size.estimation.threshold
* hoodie.compaction.target.io
* hoodie.compaction.logfile.size.threshold
* hoodie.compaction.logfile.num.threshold
* hoodie.compaction.strategy
* hoodie.compaction.daybased.target.partitions
* hoodie.copyonwrite.insert.split.size
* hoodie.copyonwrite.insert.auto.split
* hoodie.copyonwrite.record.size.estimate
* hoodie.log.compaction.blocks.threshold

In FlinkOptions:
* compaction.async.enabled
* compaction.schedule.enabled
* compaction.delta_commits
* compaction.delta_seconds
* compaction.trigger.strategy
* compaction.target_io
* compaction.max_memory
* compaction.tasks
* compaction.timeout.seconds

Need to refactor naming with saving backward compatibility.
   Priority: Minor  (was: Major)

> Consistent naming in Compaction service
> ---
>
> Key: HUDI-7646
> URL: https://issues.apache.org/jira/browse/HUDI-7646
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Geser Dugarov
>Priority: Minor
>
> The set of configuration parameters for Compaction service is confusing.
> In HoodieCompationConfig:
> * hoodie.compact.inline
> * hoodie.compact.schedule.inline
> * hoodie.log.compaction.enable
> * hoodie.log.compaction.inline
> * hoodie.compact.inline.max.delta.commits
> * hoodie.compact.inline.max.delta.seconds
> * hoodie.compact.inline.trigger.strategy
> * hoodie.parquet.small.file.limit
> * hoodie.record.size.estimation.threshold
> * hoodie.compaction.target.io
> * hoodie.compaction.logfile.size.threshold
> * hoodie.compaction.logfile.num.threshold
> * hoodie.compaction.strategy
> * hoodie.compaction.daybased.target.partitions
> * hoodie.copyonwrite.insert.split.size
> * hoodie.copyonwrite.insert.auto.split
> * hoodie.copyonwrite.record.size.estimate
> * hoodie.log.compaction.blocks.threshold
> In FlinkOptions:
> * compaction.async.enabled
> * compaction.schedule.enabled
> * compaction.delta_commits
> * compaction.delta_seconds
> * compaction.trigger.strategy
> * compaction.target_io
> * compaction.max_memory
> * compaction.tasks
> * compaction.timeout.seconds
> Need to refactor naming with saving backward compatibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7646) Consistent naming in Compaction service

2024-04-22 Thread Geser Dugarov (Jira)
Geser Dugarov created HUDI-7646:
---

 Summary: Consistent naming in Compaction service
 Key: HUDI-7646
 URL: https://issues.apache.org/jira/browse/HUDI-7646
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Geser Dugarov






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-6438) Fix issue while inserting non-nullable array columns to nullable columns

2024-03-13 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825954#comment-17825954
 ] 

Geser Dugarov edited comment on HUDI-6438 at 3/13/24 8:03 AM:
--

First fix by commit 42799c0956f626bc47318ddd91c626b1e58a0fc8 in the master 
branch has been reverted, commit bc522a6ce4142510f43529798ef4217839d71624 in 
the master branch.

The reason, that this issue has similar issue HUDI-6219, which has been fixed 
properly without adding new parameters, corresponding commit in the master 
branch ea547e5681a007e546b8ca8cb1399da0a4cd5012.


was (Author: JIRAUSER301110):
First fix by commit 42799c0956f626bc47318ddd91c626b1e58a0fc8 in the master 
branch has been reverted, commit bc522a6ce4142510f43529798ef4217839d71624 in 
the master branch.

The reason, that this issue has similar issue 
[HUDI-6219|https://issues.apache.org/jira/browse/HUDI-6219], which has been 
fixed properly without adding new parameters.

> Fix issue while inserting non-nullable array columns to nullable columns
> 
>
> Key: HUDI-6438
> URL: https://issues.apache.org/jira/browse/HUDI-6438
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.1.0
>
>
> Github issue - [https://github.com/apache/hudi/issues/9042]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6438) Fix issue while inserting non-nullable array columns to nullable columns

2024-03-13 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825954#comment-17825954
 ] 

Geser Dugarov commented on HUDI-6438:
-

First fix by commit 42799c0956f626bc47318ddd91c626b1e58a0fc8 in the master 
branch has been reverted, commit bc522a6ce4142510f43529798ef4217839d71624 in 
the master branch.

The reason, that this issue has similar issue 
[HUDI-6219|https://issues.apache.org/jira/browse/HUDI-6219], which has been 
fixed properly without adding new parameters.

> Fix issue while inserting non-nullable array columns to nullable columns
> 
>
> Key: HUDI-6438
> URL: https://issues.apache.org/jira/browse/HUDI-6438
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.1.0
>
>
> Github issue - [https://github.com/apache/hudi/issues/9042]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6219) Ensure consistency between Spark catalog schema and Hudi schema

2024-03-13 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825953#comment-17825953
 ] 

Geser Dugarov commented on HUDI-6219:
-

Fixed in the master branch by commit ea547e5681a007e546b8ca8cb1399da0a4cd5012.

> Ensure consistency between Spark catalog schema and Hudi schema
> ---
>
> Key: HUDI-6219
> URL: https://issues.apache.org/jira/browse/HUDI-6219
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wechar
>Priority: Major
>  Labels: pull-request-available
>
> [HUDI-4149|https://github.com/apache/hudi/pull/5672] fix the drop table error 
> if table directory moved, but it will make the Spark catalog table schema not 
> consistent with Hudi schema if some column types are not Avro data types.
> *Root cause:*
> Hudi schema is Avro types, but Spark catalog table schema is not. There are 
> two steps to record schema when create a hudi table:
> Step1: record the Avro compatible schema to .hoodie/hoodie.properties, 
> Step2: record table in Spark catalog
> The Step2 will use HoodieCatalog.tableSchema, which is table.schema now and 
> cause this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-7493) Clean configuration for clean service

2024-03-12 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825613#comment-17825613
 ] 

Geser Dugarov edited comment on HUDI-7493 at 3/12/24 12:17 PM:
---

Could be labeled by the "Config Simplification" epic.


was (Author: JIRAUSER301110):
Could be label by the ["Config Simplification" 
epic|https://issues.apache.org/jira/browse/HUDI-5738].

> Clean configuration for clean service
> -
>
> Key: HUDI-7493
> URL: https://issues.apache.org/jira/browse/HUDI-7493
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
>  Labels: pull-request-available
>
> Sometimes we use `{{{}hoodie.clean.*`{}}}  and sometimes 
> `{{{}hoodie.cleaner.*`.{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7493) Clean configuration for clean service

2024-03-12 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825613#comment-17825613
 ] 

Geser Dugarov commented on HUDI-7493:
-

Could be label by the ["Config Simplification" 
epic|https://issues.apache.org/jira/browse/HUDI-5738].

> Clean configuration for clean service
> -
>
> Key: HUDI-7493
> URL: https://issues.apache.org/jira/browse/HUDI-7493
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
>
> Sometimes we use `{{{}hoodie.clean.*`{}}}  and sometimes 
> `{{{}hoodie.cleaner.*`.{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   >