[jira] [Commented] (HUDI-7938) Missed HoodieSparkKryoRegistrar in Hadoop config by default
[ https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17864995#comment-17864995 ] Geser Dugarov commented on HUDI-7938: - Raised an issue to discuss expected behavior in 1.0-rc2 release: https://github.com/apache/hudi/issues/11616 > Missed HoodieSparkKryoRegistrar in Hadoop config by default > --- > > Key: HUDI-7938 > URL: https://issues.apache.org/jira/browse/HUDI-7938 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > HUDI-7567 Add schema evolution to the filegroup reader (#10957), > but broke integration with PySpark. > When trying to call > {quote}df_load = > spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path) > df_load.collect() > {quote} > > got: > > {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 > (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException > at org.apache.hadoop.conf.Configuration.(Configuration.java:842) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) > at > org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) > at > org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {quote} > Spark 3.4.3 was used. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7709) ClassCastException while reading the data using TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17864971#comment-17864971 ] Geser Dugarov commented on HUDI-7709: - [~codope] I prepared another fix without any use of nulls. Could you, please, look at the corresponding MR: [https://github.com/apache/hudi/pull/11615] ? > ClassCastException while reading the data using TimestampBasedKeyGenerator > -- > > Key: HUDI-7709 > URL: https://issues.apache.org/jira/browse/HUDI-7709 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: Aditya Goenka >Assignee: Geser Dugarov >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0 > > > Github Issue - [https://github.com/apache/hudi/issues/11140] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-7709) ClassCastException while reading the data using TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17864959#comment-17864959 ] Geser Dugarov edited comment on HUDI-7709 at 7/11/24 8:16 AM: -- [~codope] If you don't mind, could you, please, write any description how to reproduce NPE? Couldn't find suitable test scenario. was (Author: JIRAUSER301110): [~codope] If you don't mind, could you, please, write any description how to reproduce NPE? > ClassCastException while reading the data using TimestampBasedKeyGenerator > -- > > Key: HUDI-7709 > URL: https://issues.apache.org/jira/browse/HUDI-7709 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: Aditya Goenka >Assignee: Geser Dugarov >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0 > > > Github Issue - [https://github.com/apache/hudi/issues/11140] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7709) ClassCastException while reading the data using TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17864959#comment-17864959 ] Geser Dugarov commented on HUDI-7709: - [~codope] If you don't mind, could you, please, write any description how to reproduce NPE? > ClassCastException while reading the data using TimestampBasedKeyGenerator > -- > > Key: HUDI-7709 > URL: https://issues.apache.org/jira/browse/HUDI-7709 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: Aditya Goenka >Assignee: Geser Dugarov >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0 > > > Github Issue - [https://github.com/apache/hudi/issues/11140] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7709) ClassCastException while reading the data using TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7709: Summary: ClassCastException while reading the data using TimestampBasedKeyGenerator (was: Class Cast Exception while reading the data using TimestampBasedKeyGenerator) > ClassCastException while reading the data using TimestampBasedKeyGenerator > -- > > Key: HUDI-7709 > URL: https://issues.apache.org/jira/browse/HUDI-7709 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: Aditya Goenka >Assignee: Geser Dugarov >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0 > > > Github Issue - [https://github.com/apache/hudi/issues/11140] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-7938) Missed HoodieSparkKryoRegistrar in broadcasted storage config
[ https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862726#comment-17862726 ] Geser Dugarov edited comment on HUDI-7938 at 7/5/24 8:33 AM: - There is missed spark.kryo.registrator = org.apache.spark.HoodieSparkKryoRegistrar in Hadoop configuration. was (Author: JIRAUSER301110): There is missed spark.kryo.registrator = org.apache.spark.HoodieSparkKryoRegistrar in configuration. > Missed HoodieSparkKryoRegistrar in broadcasted storage config > - > > Key: HUDI-7938 > URL: https://issues.apache.org/jira/browse/HUDI-7938 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > HUDI-7567 Add schema evolution to the filegroup reader (#10957), > but broke integration with PySpark. > When trying to call > {quote}df_load = > spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path) > df_load.collect() > {quote} > > got: > > {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 > (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException > at org.apache.hadoop.conf.Configuration.(Configuration.java:842) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) > at > org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) > at > org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {quote} > Spark 3.4.3 was used. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7938) Missed HoodieSparkKryoRegistrar in Hadoop config by default
[ https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7938: Summary: Missed HoodieSparkKryoRegistrar in Hadoop config by default (was: Missed HoodieSparkKryoRegistrar in broadcasted storage config) > Missed HoodieSparkKryoRegistrar in Hadoop config by default > --- > > Key: HUDI-7938 > URL: https://issues.apache.org/jira/browse/HUDI-7938 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > HUDI-7567 Add schema evolution to the filegroup reader (#10957), > but broke integration with PySpark. > When trying to call > {quote}df_load = > spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path) > df_load.collect() > {quote} > > got: > > {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 > (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException > at org.apache.hadoop.conf.Configuration.(Configuration.java:842) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) > at > org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) > at > org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {quote} > Spark 3.4.3 was used. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7938) Missed HoodieSparkKryoRegistrar in broadcasted storage config
[ https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7938: Summary: Missed HoodieSparkKryoRegistrar in broadcasted storage config (was: NullPointerException during read from PySpark) > Missed HoodieSparkKryoRegistrar in broadcasted storage config > - > > Key: HUDI-7938 > URL: https://issues.apache.org/jira/browse/HUDI-7938 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > HUDI-7567 Add schema evolution to the filegroup reader (#10957), > but broke integration with PySpark. > When trying to call > {quote}df_load = > spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path) > df_load.collect() > {quote} > > got: > > {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 > (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException > at org.apache.hadoop.conf.Configuration.(Configuration.java:842) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) > at > org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) > at > org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {quote} > Spark 3.4.3 was used. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used in partition column
[ https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7952: Summary: Incorrect partition pruning when TimestampBasedKeyGenerator is used in partition column (was: Incorrect partition pruning when TimestampBasedKeyGenerator is used) > Incorrect partition pruning when TimestampBasedKeyGenerator is used in > partition column > --- > > Key: HUDI-7952 > URL: https://issues.apache.org/jira/browse/HUDI-7952 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 > with nulls as partition columns values could lead to an empty query results. > HoodieFileIndex.listFiles() would return Seq of > {color:#00}PartitionDirectory with null values.{color} > > {color:#00}But there is another problem with range filters on partition > column.{color} > {color:#00}For instance, we have UNIX_TIMESTAMP in column ts.{color} > And the table is also partitioned by ts with > hoodie.keygen.timebased.output.dateformat = "-MM-dd HH" > {color:#00}For execution of query like:{color} > SELECT ... WHERE ts BETWEEN 1078016000 and 1718953003 ... > it's not possible to filter rows properly. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used
[ https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7952: Description: Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 with nulls as partition columns values could lead to an empty query results. HoodieFileIndex.listFiles() would return Seq of {color:#00}PartitionDirectory with null values.{color} {color:#00}But there is another problem with range filters on partition column.{color} {color:#00}For instance, we have UNIX_TIMESTAMP in column ts.{color} And the table is also partitioned by ts with hoodie.keygen.timebased.output.dateformat = "-MM-dd HH" {color:#00}For execution of query like:{color} SELECT ... WHERE ts BETWEEN 1078016000 and 1718953003 ... it's not possible to filter rows properly. was: Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 with nulls as partition columns values could lead to an empty query results. HoodieFileIndex.listFiles() would return Seq of {color:#00}PartitionDirectory with null values.{color} {color:#00}But there is another problem with partition range filters.{color} {color:#00}For instance, for UNIX_TIMESTAMP, column ts, we set:{color} SELECT ... WHERE ts BETWEEN 1078016000 and 1718953003 ... And the table is also partitioned by ts with hoodie.keygen.timebased.output.dateformat = "-MM-dd HH" > Incorrect partition pruning when TimestampBasedKeyGenerator is used > --- > > Key: HUDI-7952 > URL: https://issues.apache.org/jira/browse/HUDI-7952 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 > with nulls as partition columns values could lead to an empty query results. > HoodieFileIndex.listFiles() would return Seq of > {color:#00}PartitionDirectory with null values.{color} > > {color:#00}But there is another problem with range filters on partition > column.{color} > {color:#00}For instance, we have UNIX_TIMESTAMP in column ts.{color} > And the table is also partitioned by ts with > hoodie.keygen.timebased.output.dateformat = "-MM-dd HH" > {color:#00}For execution of query like:{color} > SELECT ... WHERE ts BETWEEN 1078016000 and 1718953003 ... > it's not possible to filter rows properly. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used
[ https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7952: Description: Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 with nulls as partition columns values could lead to an empty query results. HoodieFileIndex.listFiles() would return Seq of {color:#00}PartitionDirectory with null values.{color} {color:#00}But there is another problem with partition range filters.{color} {color:#00}For instance, for UNIX_TIMESTAMP, column ts, we set:{color} SELECT ... WHERE ts BETWEEN 1078016000 and 1718953003 ... And the table is also partitioned by ts with hoodie.keygen.timebased.output.dateformat = "-MM-dd HH" was: Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 with nulls as partition columns values could lead to an empty query results. HoodieFileIndex.listFiles() would return Seq of {color:#00}PartitionDirectory with null values.{color} {color:#00}But also there is a problem with partition range filters.{color} {color:#00}For instance, for UNIX_TIMESTAMP we set: {color} > Incorrect partition pruning when TimestampBasedKeyGenerator is used > --- > > Key: HUDI-7952 > URL: https://issues.apache.org/jira/browse/HUDI-7952 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 > with nulls as partition columns values could lead to an empty query results. > HoodieFileIndex.listFiles() would return Seq of > {color:#00}PartitionDirectory with null values.{color} > > {color:#00}But there is another problem with partition range > filters.{color} > {color:#00}For instance, for UNIX_TIMESTAMP, column ts, we set:{color} > SELECT ... WHERE ts BETWEEN 1078016000 and 1718953003 ... > And the table is also partitioned by ts with > hoodie.keygen.timebased.output.dateformat = "-MM-dd HH" > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used
[ https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7952: Description: Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 with nulls as partition columns values could lead to an empty query results. HoodieFileIndex.listFiles() would return Seq of {color:#00}PartitionDirectory with null values.{color} {color:#00}But also there is a problem with partition range filters.{color} {color:#00}For instance, for UNIX_TIMESTAMP we set: {color} was: Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 with nulls as partition columns values could lead to an empty query results. HoodieFileIndex.listFiles() would return Seq of {color:#00}PartitionDirectory with null values.{color} {color:#00}But also there is a problem with {color} > Incorrect partition pruning when TimestampBasedKeyGenerator is used > --- > > Key: HUDI-7952 > URL: https://issues.apache.org/jira/browse/HUDI-7952 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 > with nulls as partition columns values could lead to an empty query results. > HoodieFileIndex.listFiles() would return Seq of > {color:#00}PartitionDirectory with null values.{color} > > {color:#00}But also there is a problem with partition range > filters.{color} > {color:#00}For instance, for UNIX_TIMESTAMP we set: > {color} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used
[ https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7952: Description: Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 with nulls as partition columns values could lead to an empty query results. HoodieFileIndex.listFiles() would return Seq of {color:#00}PartitionDirectory with null values.{color} {color:#00}But also there is a problem with {color} was: Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 with nulls as partition columns values could lead to an empty query results. HoodieFileIndex.listFiles() would return Seq of {color:#00}PartitionDirectory with null values. {color} > Incorrect partition pruning when TimestampBasedKeyGenerator is used > --- > > Key: HUDI-7952 > URL: https://issues.apache.org/jira/browse/HUDI-7952 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 > with nulls as partition columns values could lead to an empty query results. > HoodieFileIndex.listFiles() would return Seq of > {color:#00}PartitionDirectory with null values.{color} > > {color:#00}But also there is a problem with {color} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used
[ https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov reassigned HUDI-7952: --- Assignee: Geser Dugarov > Incorrect partition pruning when TimestampBasedKeyGenerator is used > --- > > Key: HUDI-7952 > URL: https://issues.apache.org/jira/browse/HUDI-7952 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 > with nulls as partition columns values could lead to an empty query results. > HoodieFileIndex.listFiles() would return Seq of > {color:#00}PartitionDirectory with null values. > {color} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used
[ https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7952: Description: Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 with nulls as partition columns values could lead to an empty query results. HoodieFileIndex.listFiles() would return Seq of {color:#00}PartitionDirectory with null values. {color} was: Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 with nulls as partition columns values could lead to an empty query results. HoodieFileIndex.listFiles() would return empty Seq of {color:#00}PartitionDirectory due to {color} > Incorrect partition pruning when TimestampBasedKeyGenerator is used > --- > > Key: HUDI-7952 > URL: https://issues.apache.org/jira/browse/HUDI-7952 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Priority: Major > > Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 > with nulls as partition columns values could lead to an empty query results. > HoodieFileIndex.listFiles() would return Seq of > {color:#00}PartitionDirectory with null values. > {color} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used
[ https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7952: Description: Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 with nulls as partition columns values could lead to an empty query results. HoodieFileIndex.listFiles() would return empty {color:#00}PartitionDirectory {color} was: Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 with nulls as partition columns values could lead to an empty query results. HoodieFileIndex > Incorrect partition pruning when TimestampBasedKeyGenerator is used > --- > > Key: HUDI-7952 > URL: https://issues.apache.org/jira/browse/HUDI-7952 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Priority: Major > > Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 > with nulls as partition columns values could lead to an empty query results. > HoodieFileIndex.listFiles() would return empty > {color:#00}PartitionDirectory > {color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used
[ https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7952: Description: Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 with nulls as partition columns values could lead to an empty query results. HoodieFileIndex.listFiles() would return empty Seq of {color:#00}PartitionDirectory due to {color} was: Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 with nulls as partition columns values could lead to an empty query results. HoodieFileIndex.listFiles() would return empty {color:#00}PartitionDirectory {color} > Incorrect partition pruning when TimestampBasedKeyGenerator is used > --- > > Key: HUDI-7952 > URL: https://issues.apache.org/jira/browse/HUDI-7952 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Priority: Major > > Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 > with nulls as partition columns values could lead to an empty query results. > HoodieFileIndex.listFiles() would return empty Seq of > {color:#00}PartitionDirectory due to > {color} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used
[ https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7952: Description: Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 with nulls as partition columns values could lead to an empty query results. HoodieFileIndex was:Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 is missed of partition pruning check. > Incorrect partition pruning when TimestampBasedKeyGenerator is used > --- > > Key: HUDI-7952 > URL: https://issues.apache.org/jira/browse/HUDI-7952 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Priority: Major > > Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 > with nulls as partition columns values could lead to an empty query results. > HoodieFileIndex -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used
[ https://issues.apache.org/jira/browse/HUDI-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7952: Description: Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 is missed of partition pruning check. > Incorrect partition pruning when TimestampBasedKeyGenerator is used > --- > > Key: HUDI-7952 > URL: https://issues.apache.org/jira/browse/HUDI-7952 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Priority: Major > > Fix of ClassCastException in https://issues.apache.org/jira/browse/HUDI-7709 > is missed of partition pruning check. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7952) Incorrect partition pruning when TimestampBasedKeyGenerator is used
Geser Dugarov created HUDI-7952: --- Summary: Incorrect partition pruning when TimestampBasedKeyGenerator is used Key: HUDI-7952 URL: https://issues.apache.org/jira/browse/HUDI-7952 Project: Apache Hudi Issue Type: Bug Reporter: Geser Dugarov -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7938) NullPointerException during read from PySpark
[ https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862922#comment-17862922 ] Geser Dugarov commented on HUDI-7938: - [~yihua] , if you don't mind, could you, please, clarify what to do with registration of Hudi serializer in Spark? > NullPointerException during read from PySpark > - > > Key: HUDI-7938 > URL: https://issues.apache.org/jira/browse/HUDI-7938 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > HUDI-7567 Add schema evolution to the filegroup reader (#10957), > but broke integration with PySpark. > When trying to call > {quote}df_load = > spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path) > df_load.collect() > {quote} > > got: > > {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 > (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException > at org.apache.hadoop.conf.Configuration.(Configuration.java:842) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) > at > org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) > at > org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {quote} > Spark 3.4.3 was used. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7938) NullPointerException during read from PySpark
[ https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862921#comment-17862921 ] Geser Dugarov commented on HUDI-7938: - To support run from PySpark without set spark.kryo.registrator this MR has been landed: [https://github.com/apache/hudi/pull/11355] But after landed [https://github.com/apache/hudi/pull/10957] we need to set it again. For now, I don't know should we decide to make this configuration mandatory or make some changes in the code. Leave this task for some time as it is. > NullPointerException during read from PySpark > - > > Key: HUDI-7938 > URL: https://issues.apache.org/jira/browse/HUDI-7938 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > HUDI-7567 Add schema evolution to the filegroup reader (#10957), > but broke integration with PySpark. > When trying to call > {quote}df_load = > spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path) > df_load.collect() > {quote} > > got: > > {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 > (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException > at org.apache.hadoop.conf.Configuration.(Configuration.java:842) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) > at > org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) > at > org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {quote} > Spark 3.4.3 was used. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7938) NullPointerException during read from PySpark
[ https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7938: Status: Open (was: In Progress) > NullPointerException during read from PySpark > - > > Key: HUDI-7938 > URL: https://issues.apache.org/jira/browse/HUDI-7938 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > HUDI-7567 Add schema evolution to the filegroup reader (#10957), > but broke integration with PySpark. > When trying to call > {quote}df_load = > spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path) > df_load.collect() > {quote} > > got: > > {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 > (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException > at org.apache.hadoop.conf.Configuration.(Configuration.java:842) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) > at > org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) > at > org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {quote} > Spark 3.4.3 was used. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7938) NullPointerException during read from PySpark
[ https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862726#comment-17862726 ] Geser Dugarov commented on HUDI-7938: - There is missed spark.kryo.registrator = org.apache.spark.HoodieSparkKryoRegistrar in configuration. > NullPointerException during read from PySpark > - > > Key: HUDI-7938 > URL: https://issues.apache.org/jira/browse/HUDI-7938 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > HUDI-7567 Add schema evolution to the filegroup reader (#10957), > but broke integration with PySpark. > When trying to call > {quote}df_load = > spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path) > df_load.collect() > {quote} > > got: > > {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 > (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException > at org.apache.hadoop.conf.Configuration.(Configuration.java:842) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) > at > org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) > at > org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {quote} > Spark 3.4.3 was used. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7938) NullPointerException during read from PySpark
[ https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7938: Status: In Progress (was: Open) > NullPointerException during read from PySpark > - > > Key: HUDI-7938 > URL: https://issues.apache.org/jira/browse/HUDI-7938 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > HUDI-7567 Add schema evolution to the filegroup reader (#10957), > but broke integration with PySpark. > When trying to call > {quote}df_load = > spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path) > df_load.collect() > {quote} > > got: > > {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 > (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException > at org.apache.hadoop.conf.Configuration.(Configuration.java:842) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) > at > org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) > at > org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {quote} > Spark 3.4.3 was used. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7938) NullPointerException during read from PySpark
[ https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862443#comment-17862443 ] Geser Dugarov commented on HUDI-7938: - Also reproduced with Spark 3.5.1. > NullPointerException during read from PySpark > - > > Key: HUDI-7938 > URL: https://issues.apache.org/jira/browse/HUDI-7938 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > HUDI-7567 Add schema evolution to the filegroup reader (#10957), > but broke integration with PySpark. > When trying to call > {quote}df_load = > spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path) > df_load.collect() > {quote} > > got: > > {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 > (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException > at org.apache.hadoop.conf.Configuration.(Configuration.java:842) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) > at > org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) > at > org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {quote} > Spark 3.4.3 was used. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7938) NullPointerException during read from PySpark
[ https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7938: Description: HUDI-7567 Add schema evolution to the filegroup reader (#10957), but broke integration with PySpark. When trying to call {quote}df_load = spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path) df_load.collect() {quote} got: {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException at org.apache.hadoop.conf.Configuration.(Configuration.java:842) at org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) at org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) at org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) at org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {quote} Spark 3.4.3 was used. was: HUDI-7567 Add schema evolution to the filegroup reader (#10957), but broke integration with PySpark. When trying to call {quote}df_load = spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path) df_load.collect(){quote} got: {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException at org.apache.hadoop.conf.Configuration.(Configuration.java:842) at org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) at org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) at org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) at org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) at
[jira] [Updated] (HUDI-7938) NullPointerException during read from PySpark
[ https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7938: Description: HUDI-7567 Add schema evolution to the filegroup reader (#10957), but broke integration with PySpark. When trying to call {quote}df_load = spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path) df_load.collect(){quote} got: {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException at org.apache.hadoop.conf.Configuration.(Configuration.java:842) at org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) at org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) at org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) at org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {quote} was: HUDI-7567 Add schema evolution to the filegroup reader (#10957), but broke integration with PySpark. Got: {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException at org.apache.hadoop.conf.Configuration.(Configuration.java:842) at org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) at org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) at org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) at org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at
[jira] [Updated] (HUDI-7938) NullPointerException during read from PySpark
[ https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7938: Description: HUDI-7567 Add schema evolution to the filegroup reader (#10957), but broke integration with PySpark. Got: {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException at org.apache.hadoop.conf.Configuration.(Configuration.java:842) at org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) at org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) at org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) at org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {quote} was: HUDI-7567 Add schema evolution to the filegroup reader (#10957) broke integration with PySpark. Got: {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException at org.apache.hadoop.conf.Configuration.(Configuration.java:842) at org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) at org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) at org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) at org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at
[jira] [Assigned] (HUDI-7938) NullPointerException during read from PySpark
[ https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov reassigned HUDI-7938: --- Assignee: Geser Dugarov > NullPointerException during read from PySpark > - > > Key: HUDI-7938 > URL: https://issues.apache.org/jira/browse/HUDI-7938 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > HUDI-7567 Add schema evolution to the filegroup reader (#10957) broke > integration with PySpark. > Got: > > {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 > (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException > at org.apache.hadoop.conf.Configuration.(Configuration.java:842) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) > at > org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) > at > org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7938) NullPointerException during read from PySpark
[ https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7938: Description: [HUDI-7567] Add schema evolution to the filegroup reader (#10957) broke integration with PySpark. Got: ``` 24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException at org.apache.hadoop.conf.Configuration.(Configuration.java:842) at org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) at org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) at org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) at org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) ``` > NullPointerException during read from PySpark > - > > Key: HUDI-7938 > URL: https://issues.apache.org/jira/browse/HUDI-7938 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Priority: Major > > [HUDI-7567] Add schema evolution to the filegroup reader (#10957) broke > integration with PySpark. > Got: > ``` > 24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 (TID 31) > (10.199.141.90 executor 0): java.lang.NullPointerException > at org.apache.hadoop.conf.Configuration.(Configuration.java:842) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) > at > org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) > at > org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) > at >
[jira] [Updated] (HUDI-7938) NullPointerException during read from PySpark
[ https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7938: Description: HUDI-7567 Add schema evolution to the filegroup reader (#10957) broke integration with PySpark. Got: {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException at org.apache.hadoop.conf.Configuration.(Configuration.java:842) at org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) at org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) at org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) at org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {quote} was: [HUDI-7567] Add schema evolution to the filegroup reader (#10957) broke integration with PySpark. Got: ``` 24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException at org.apache.hadoop.conf.Configuration.(Configuration.java:842) at org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) at org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) at org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) at org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at
[jira] [Created] (HUDI-7938) NullPointerException during read from PySpark
Geser Dugarov created HUDI-7938: --- Summary: NullPointerException during read from PySpark Key: HUDI-7938 URL: https://issues.apache.org/jira/browse/HUDI-7938 Project: Apache Hudi Issue Type: Bug Reporter: Geser Dugarov -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6438) Fix issue while inserting non-nullable array columns to nullable columns
[ https://issues.apache.org/jira/browse/HUDI-6438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov closed HUDI-6438. --- Resolution: Fixed Fixed in HUDI-6219 > Fix issue while inserting non-nullable array columns to nullable columns > > > Key: HUDI-6438 > URL: https://issues.apache.org/jira/browse/HUDI-6438 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Aditya Goenka >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.0 > > > Github issue - [https://github.com/apache/hudi/issues/9042] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6438) Fix issue while inserting non-nullable array columns to nullable columns
[ https://issues.apache.org/jira/browse/HUDI-6438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov reassigned HUDI-6438: --- Assignee: Geser Dugarov > Fix issue while inserting non-nullable array columns to nullable columns > > > Key: HUDI-6438 > URL: https://issues.apache.org/jira/browse/HUDI-6438 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Aditya Goenka >Assignee: Geser Dugarov >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.0 > > > Github issue - [https://github.com/apache/hudi/issues/9042] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6219) Ensure consistency between Spark catalog schema and Hudi schema
[ https://issues.apache.org/jira/browse/HUDI-6219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov closed HUDI-6219. --- Resolution: Fixed > Ensure consistency between Spark catalog schema and Hudi schema > --- > > Key: HUDI-6219 > URL: https://issues.apache.org/jira/browse/HUDI-6219 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wechar >Priority: Major > Labels: pull-request-available > > [HUDI-4149|https://github.com/apache/hudi/pull/5672] fix the drop table error > if table directory moved, but it will make the Spark catalog table schema not > consistent with Hudi schema if some column types are not Avro data types. > *Root cause:* > Hudi schema is Avro types, but Spark catalog table schema is not. There are > two steps to record schema when create a hudi table: > Step1: record the Avro compatible schema to .hoodie/hoodie.properties, > Step2: record table in Spark catalog > The Step2 will use HoodieCatalog.tableSchema, which is table.schema now and > cause this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6438) Fix issue while inserting non-nullable array columns to nullable columns
[ https://issues.apache.org/jira/browse/HUDI-6438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-6438: Fix Version/s: 0.14.0 (was: 1.1.0) > Fix issue while inserting non-nullable array columns to nullable columns > > > Key: HUDI-6438 > URL: https://issues.apache.org/jira/browse/HUDI-6438 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Aditya Goenka >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.0 > > > Github issue - [https://github.com/apache/hudi/issues/9042] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6219) Ensure consistency between Spark catalog schema and Hudi schema
[ https://issues.apache.org/jira/browse/HUDI-6219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-6219: Fix Version/s: 0.14.0 > Ensure consistency between Spark catalog schema and Hudi schema > --- > > Key: HUDI-6219 > URL: https://issues.apache.org/jira/browse/HUDI-6219 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wechar >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > [HUDI-4149|https://github.com/apache/hudi/pull/5672] fix the drop table error > if table directory moved, but it will make the Spark catalog table schema not > consistent with Hudi schema if some column types are not Avro data types. > *Root cause:* > Hudi schema is Avro types, but Spark catalog table schema is not. There are > two steps to record schema when create a hudi table: > Step1: record the Avro compatible schema to .hoodie/hoodie.properties, > Step2: record table in Spark catalog > The Step2 will use HoodieCatalog.tableSchema, which is table.schema now and > cause this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7493) Clean configuration for clean service
[ https://issues.apache.org/jira/browse/HUDI-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov reassigned HUDI-7493: --- Assignee: Geser Dugarov (was: Lin Liu) > Clean configuration for clean service > - > > Key: HUDI-7493 > URL: https://issues.apache.org/jira/browse/HUDI-7493 > Project: Apache Hudi > Issue Type: Bug > Components: cleaning, configs, table-service >Reporter: Lin Liu >Assignee: Geser Dugarov >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0-beta2, 1.0.0 > > > Sometimes we use `{{{}hoodie.clean.*`{}}} and sometimes > `{{{}hoodie.cleaner.*`.{}}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7487) Investigate flaky test in MERGE INTO
[ https://issues.apache.org/jira/browse/HUDI-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov reassigned HUDI-7487: --- Assignee: Geser Dugarov > Investigate flaky test in MERGE INTO > > > Key: HUDI-7487 > URL: https://issues.apache.org/jira/browse/HUDI-7487 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Geser Dugarov >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > No production code changes, but this test started to fail: > {code:java} > - Test MERGE INTO with inserts only on MOR table when partial updates are > enabled *** FAILED *** > Expected Array([1,a1,10.0,1000,a1: desc1], [2,a2,20.0,1200,a2: desc2], > [3,a3,30.0,1250,a3: desc3], [4,a4,60.0,1270,a4: desc4]), but got > Array([1,a1,10.0,1000,a1: desc1], [2,a2,20.0,1200,a2: desc2], > [3,a3,30.0,1250,a3: desc3]) (HoodieSparkSqlTestBase.scala:109) > 1564068 [ScalaTest-main-running-TestPartialUpdateForMergeInto] WARN > org.apache.hudi.common.table.TableSchemaResolver [] - Could not find any data > file written for commit, so could not get schema for table > file:/tmp/spark-037c0206-b70d-47ee-9f85-3b6fc12bf1a5/h9 > 1564072 [ScalaTest-main-running-TestPartialUpdateForMergeInto] WARN > org.apache.hudi.common.table.TableSchemaResolver [] - Could not find any data > file written for commit, so could not get schema for table > file:/tmp/spark-037c0206-b70d-47ee-9f85-3b6fc12bf1a5/h9 > 1564094 [ScalaTest-main-running-TestPartialUpdateForMergeInto] WARN > org.apache.hudi.common.table.TableSchemaResolver [] - Could not find any data > file written for commit, so could not get schema for table > file:/tmp/spark-037c0206-b70d-47ee-9f85-3b6fc12bf1a5/h10 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-6947) Clean up HoodieSparkSqlWriter.deduceWriterSchema
[ https://issues.apache.org/jira/browse/HUDI-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17860402#comment-17860402 ] Geser Dugarov edited comment on HUDI-6947 at 6/27/24 9:51 AM: -- Fixed in the master branch, 2e39bfb694099293b77eec9977e5e46af97af18b was (Author: JIRAUSER301110): Fixed in the master branch, cddd7d416a5db31de879790a80a33bb86cf02cbc > Clean up HoodieSparkSqlWriter.deduceWriterSchema > > > Key: HUDI-6947 > URL: https://issues.apache.org/jira/browse/HUDI-6947 > Project: Apache Hudi > Issue Type: Improvement > Components: code-quality, configs, spark, spark-sql >Reporter: Jonathan Vexler >Assignee: Geser Dugarov >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > too many flags here: > ADD_NULL_FOR_DELETED_COLUMNS > RECONCILE_SCHEMA > AVRO_SCHEMA_VALIDATE_ENABLE -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6947) Clean up HoodieSparkSqlWriter.deduceWriterSchema
[ https://issues.apache.org/jira/browse/HUDI-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17860402#comment-17860402 ] Geser Dugarov commented on HUDI-6947: - Fixed in the master branch, cddd7d416a5db31de879790a80a33bb86cf02cbc > Clean up HoodieSparkSqlWriter.deduceWriterSchema > > > Key: HUDI-6947 > URL: https://issues.apache.org/jira/browse/HUDI-6947 > Project: Apache Hudi > Issue Type: Improvement > Components: code-quality, configs, spark, spark-sql >Reporter: Jonathan Vexler >Assignee: Geser Dugarov >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > too many flags here: > ADD_NULL_FOR_DELETED_COLUMNS > RECONCILE_SCHEMA > AVRO_SCHEMA_VALIDATE_ENABLE -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6947) Clean up HoodieSparkSqlWriter.deduceWriterSchema
[ https://issues.apache.org/jira/browse/HUDI-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov reassigned HUDI-6947: --- Assignee: Geser Dugarov > Clean up HoodieSparkSqlWriter.deduceWriterSchema > > > Key: HUDI-6947 > URL: https://issues.apache.org/jira/browse/HUDI-6947 > Project: Apache Hudi > Issue Type: Improvement > Components: code-quality, configs, spark, spark-sql >Reporter: Jonathan Vexler >Assignee: Geser Dugarov >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > too many flags here: > ADD_NULL_FOR_DELETED_COLUMNS > RECONCILE_SCHEMA > AVRO_SCHEMA_VALIDATE_ENABLE -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7275) org.apache.hudi.TestHoodieSparkSqlWriter#testInsertDatasetWithTimelineTimezoneUTC causes issues with following tests
[ https://issues.apache.org/jira/browse/HUDI-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov reassigned HUDI-7275: --- Assignee: Geser Dugarov > org.apache.hudi.TestHoodieSparkSqlWriter#testInsertDatasetWithTimelineTimezoneUTC > causes issues with following tests > > > Key: HUDI-7275 > URL: https://issues.apache.org/jira/browse/HUDI-7275 > Project: Apache Hudi > Issue Type: Bug >Reporter: Jonathan Vexler >Assignee: Geser Dugarov >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > When the next test runs, it gets stuck in an infinite loop and the output is > {code:java} > 60331 [main] INFO org.apache.hudi.common.table.timeline.TimeGeneratorBase [] > - Released the connection of the timeGenerator lock > 60331 [main] INFO org.apache.hudi.common.table.timeline.TimeGeneratorBase [] > - LockProvider for TimeGenerator: > org.apache.hudi.client.transaction.lock.InProcessLockProvider > 60331 [main] INFO > org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path > /var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/hoodie_test_path7599985521109702031_1, > Lock Instance > java.util.concurrent.locks.ReentrantReadWriteLock@5d045508[Write locks = 0, > Read locks = 0], Thread main, In-process lock state ACQUIRING > 60331 [main] INFO > org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path > /var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/hoodie_test_path7599985521109702031_1, > Lock Instance > java.util.concurrent.locks.ReentrantReadWriteLock@5d045508[Write locks = 1, > Read locks = 0], Thread main, In-process lock state ACQUIRED > 60333 [main] INFO > org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path > /var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/hoodie_test_path7599985521109702031_1, > Lock Instance > java.util.concurrent.locks.ReentrantReadWriteLock@5d045508[Write locks = 1, > Read locks = 0], Thread main, In-process lock state RELEASING > 60333 [main] INFO > org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path > /var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/hoodie_test_path7599985521109702031_1, > Lock Instance > java.util.concurrent.locks.ReentrantReadWriteLock@5d045508[Write locks = 0, > Read locks = 0], Thread main, In-process lock state RELEASED > 60333 [main] INFO > org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path > /var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/hoodie_test_path7599985521109702031_1, > Lock Instance > java.util.concurrent.locks.ReentrantReadWriteLock@5d045508[Write locks = 0, > Read locks = 0], Thread main, In-process lock state ALREADY_RELEASED > 60333 [main] INFO org.apache.hudi.common.table.timeline.TimeGeneratorBase [] > - Released the connection of the timeGenerator lock > 60333 [main] INFO org.apache.hudi.common.table.timeline.TimeGeneratorBase [] > - LockProvider for TimeGenerator: > org.apache.hudi.client.transaction.lock.InProcessLockProvider > 60333 [main] INFO > org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path > /var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/hoodie_test_path7599985521109702031_1, > Lock Instance > java.util.concurrent.locks.ReentrantReadWriteLock@5d045508[Write locks = 0, > Read locks = 0], Thread main, In-process lock state ACQUIRING > 60333 [main] INFO > org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path > /var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/hoodie_test_path7599985521109702031_1, > Lock Instance > java.util.concurrent.locks.ReentrantReadWriteLock@5d045508[Write locks = 1, > Read locks = 0], Thread main, In-process lock state ACQUIRED > 60334 [main] INFO > org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path > /var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/hoodie_test_path7599985521109702031_1, > Lock Instance > java.util.concurrent.locks.ReentrantReadWriteLock@5d045508[Write locks = 1, > Read locks = 0], Thread main, In-process lock state RELEASING > 60334 [main] INFO > org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path > /var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/hoodie_test_path7599985521109702031_1, > Lock Instance > java.util.concurrent.locks.ReentrantReadWriteLock@5d045508[Write locks = 0, > Read locks = 0], Thread main, In-process lock state RELEASED > 60334 [main] INFO > org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path > /var/folders/d0/l7mfhzl1661byhh3mbyg5fv0gn/T/hoodie_test_path7599985521109702031_1, > Lock Instance >
[jira] [Updated] (HUDI-7646) Consistent naming in Compaction service
[ https://issues.apache.org/jira/browse/HUDI-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7646: Status: Open (was: In Progress) > Consistent naming in Compaction service > --- > > Key: HUDI-7646 > URL: https://issues.apache.org/jira/browse/HUDI-7646 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Minor > Fix For: 1.0.0 > > > The set of configuration parameters for Compaction service is confusing. > In HoodieCompationConfig: > * hoodie.compact.inline > * hoodie.compact.schedule.inline > * hoodie.log.compaction.enable > * hoodie.log.compaction.inline > * hoodie.compact.inline.max.delta.commits > * hoodie.compact.inline.max.delta.seconds > * hoodie.compact.inline.trigger.strategy > * hoodie.parquet.small.file.limit > * hoodie.record.size.estimation.threshold > * hoodie.compaction.target.io > * hoodie.compaction.logfile.size.threshold > * hoodie.compaction.logfile.num.threshold > * hoodie.compaction.strategy > * hoodie.compaction.daybased.target.partitions > * hoodie.copyonwrite.insert.split.size > * hoodie.copyonwrite.insert.auto.split > * hoodie.copyonwrite.record.size.estimate > * hoodie.log.compaction.blocks.threshold > In FlinkOptions: > * compaction.async.enabled > * compaction.schedule.enabled > * compaction.delta_commits > * compaction.delta_seconds > * compaction.trigger.strategy > * compaction.target_io > * compaction.max_memory > * compaction.tasks > * compaction.timeout.seconds > Need to refactor naming with saving backward compatibility. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7850) Makes hoodie.record.merge.mode mandatory upon creating the table and first write
[ https://issues.apache.org/jira/browse/HUDI-7850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7850: Status: In Progress (was: Open) > Makes hoodie.record.merge.mode mandatory upon creating the table and first > write > > > Key: HUDI-7850 > URL: https://issues.apache.org/jira/browse/HUDI-7850 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Geser Dugarov >Priority: Major > Fix For: 1.0.0 > > > Right now, "hoodie.record.merge.mode" is optional during writes as it is > inferred from the payload class name, payload type, and the record merger > strategy during the creation of the table properties. We should make this > config mandatory in release 1.0 and make other merge configs optional to > simplify the configuration experience. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7925) Implement logic for `shouldExtractPartitionValuesFromPartitionPath` in `HoodieHadoopFsRelationFactory`
[ https://issues.apache.org/jira/browse/HUDI-7925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7925: Description: There is no logic for `shouldExtractPartitionValuesFromPartitionPath` in `HoodieHadoopFsRelationFactory`. Therefore during reading of data with "hoodie.file.group.reader.enabled" = "true", which is default behavior, we could got ClassCastException during extracting., for instance, see HUDI-7709. Need to implement logic similar to `HoodieBaseRelation`. was: There is no logic for `shouldExtractPartitionValuesFromPartitionPath` in `HoodieHadoopFsRelationFactory`. Therefore during reading of data with "hoodie.file.group.reader.enabled" = "true", which is default behavior, we could got ClassCastException during extracting., for instance, see . Need to implement logic similar to `HoodieBaseRelation`. > Implement logic for `shouldExtractPartitionValuesFromPartitionPath` in > `HoodieHadoopFsRelationFactory` > -- > > Key: HUDI-7925 > URL: https://issues.apache.org/jira/browse/HUDI-7925 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Geser Dugarov >Priority: Major > > There is no logic for `shouldExtractPartitionValuesFromPartitionPath` in > `HoodieHadoopFsRelationFactory`. Therefore during reading of data with > "hoodie.file.group.reader.enabled" = "true", which is default behavior, we > could got ClassCastException during extracting., for instance, see HUDI-7709. > Need to implement logic similar to `HoodieBaseRelation`. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7925) Implement logic for `shouldExtractPartitionValuesFromPartitionPath` in `HoodieHadoopFsRelationFactory`
[ https://issues.apache.org/jira/browse/HUDI-7925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7925: Description: There is no logic for `shouldExtractPartitionValuesFromPartitionPath` in `HoodieHadoopFsRelationFactory`. Therefore during reading of data with "hoodie.file.group.reader.enabled" = "true", which is default behavior, we could got ClassCastException during extracting., for instance, see . Need to implement logic similar to `HoodieBaseRelation`. was: There is no logic for `shouldExtractPartitionValuesFromPartitionPath` in `HoodieHadoopFsRelationFactory`. Therefore during reading of data with "hoodie.file.group.reader.enabled" = "true", which is default behavior, we got null values. Need to implement logic similar to `HoodieBaseRelation`. > Implement logic for `shouldExtractPartitionValuesFromPartitionPath` in > `HoodieHadoopFsRelationFactory` > -- > > Key: HUDI-7925 > URL: https://issues.apache.org/jira/browse/HUDI-7925 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Geser Dugarov >Priority: Major > > There is no logic for `shouldExtractPartitionValuesFromPartitionPath` in > `HoodieHadoopFsRelationFactory`. Therefore during reading of data with > "hoodie.file.group.reader.enabled" = "true", which is default behavior, we > could got ClassCastException during extracting., for instance, see . > Need to implement logic similar to `HoodieBaseRelation`. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7925) Implement logic for `shouldExtractPartitionValuesFromPartitionPath` in `HoodieHadoopFsRelationFactory`
[ https://issues.apache.org/jira/browse/HUDI-7925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7925: Summary: Implement logic for `shouldExtractPartitionValuesFromPartitionPath` in `HoodieHadoopFsRelationFactory` (was: Do not extract values from partition paths in `HoodieHadoopFsRelationFactory`) > Implement logic for `shouldExtractPartitionValuesFromPartitionPath` in > `HoodieHadoopFsRelationFactory` > -- > > Key: HUDI-7925 > URL: https://issues.apache.org/jira/browse/HUDI-7925 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Geser Dugarov >Priority: Major > > There is no logic for `shouldExtractPartitionValuesFromPartitionPath` in > `HoodieHadoopFsRelationFactory`. Therefore during reading of data with > "hoodie.file.group.reader.enabled" = "true", which is default behavior, we > got null values. > Need to implement logic similar to `HoodieBaseRelation`. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7925) Do not extract values from partition paths in `HoodieHadoopFsRelationFactory`
[ https://issues.apache.org/jira/browse/HUDI-7925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7925: Description: There is no logic for `shouldExtractPartitionValuesFromPartitionPath` in `HoodieHadoopFsRelationFactory`. Therefore during reading of data with "hoodie.file.group.reader.enabled" = "true", which is default behavior, we got null values. Need to implement logic similar to `HoodieBaseRelation`. was:`shouldExtractPartitionValuesFromPartitionPath` is not used in > Do not extract values from partition paths in `HoodieHadoopFsRelationFactory` > - > > Key: HUDI-7925 > URL: https://issues.apache.org/jira/browse/HUDI-7925 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Geser Dugarov >Priority: Major > > There is no logic for `shouldExtractPartitionValuesFromPartitionPath` in > `HoodieHadoopFsRelationFactory`. Therefore during reading of data with > "hoodie.file.group.reader.enabled" = "true", which is default behavior, we > got null values. > Need to implement logic similar to `HoodieBaseRelation`. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7925) Do not extract values from partition paths in `HoodieHadoopFsRelationFactory`
[ https://issues.apache.org/jira/browse/HUDI-7925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7925: Description: `shouldExtractPartitionValuesFromPartitionPath` is not used in > Do not extract values from partition paths in `HoodieHadoopFsRelationFactory` > - > > Key: HUDI-7925 > URL: https://issues.apache.org/jira/browse/HUDI-7925 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Geser Dugarov >Priority: Major > > `shouldExtractPartitionValuesFromPartitionPath` is not used in -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7925) Do not extract values from partition paths in `HoodieHadoopFsRelationFactory`
Geser Dugarov created HUDI-7925: --- Summary: Do not extract values from partition paths in `HoodieHadoopFsRelationFactory` Key: HUDI-7925 URL: https://issues.apache.org/jira/browse/HUDI-7925 Project: Apache Hudi Issue Type: Improvement Reporter: Geser Dugarov -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-7033) Fix read error for schema evolution + partition value extraction
[ https://issues.apache.org/jira/browse/HUDI-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859598#comment-17859598 ] Geser Dugarov edited comment on HUDI-7033 at 6/24/24 7:50 AM: -- Merged a4fa3451916de11dc082792076b62013586dadaf in linked MR 9994 refers to [non-merged MR 9889|https://github.com/apache/hudi/pull/9889] was (Author: JIRAUSER301110): Merged a4fa3451916de11dc082792076b62013586dadaf refers to [non-merged MR 9889|https://github.com/apache/hudi/pull/9889] > Fix read error for schema evolution + partition value extraction > > > Key: HUDI-7033 > URL: https://issues.apache.org/jira/browse/HUDI-7033 > Project: Apache Hudi > Issue Type: Bug >Reporter: voon >Priority: Major > Labels: pull-request-available > > After HUDI-6960 is merged, there > *shouldExtractPartitionValuesFromPartitionPath* will correctly ignore > partition columns in requiredSchema. > > When using the configs below, there will be read errors. > > {code:java} > hoodie.datasource.read.extract.partition.values.from.path = true {code} > > > When the config above is added together with: > > {code:java} > hoodie.schema.on.read.enable = true {code} > > The query schema will be pruned to **{*}NOT{*}** contain any partition > columns. > > When rebuilding parquet filters, file schema's columns are scanned against > querySchema. However, Hudi files (file schema) might still contain partition > columns. And when partition filters are being rebuilt with these file schema > against query schema, it will lead to partition columns not being found. > > {code:java} > Caused by: java.lang.IllegalArgumentException: cannot found filter col > name:region from querySchema: table { > 5: id: optional int > 6: name: optional string > 7: ts: optional long > } > at > org.apache.hudi.internal.schema.utils.InternalSchemaUtils.reBuildFilterName(InternalSchemaUtils.java:180) > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (HUDI-7033) Fix read error for schema evolution + partition value extraction
[ https://issues.apache.org/jira/browse/HUDI-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov reopened HUDI-7033: - Merged a4fa3451916de11dc082792076b62013586dadaf refer to [non-merged MR 9889|https://github.com/apache/hudi/pull/9889] > Fix read error for schema evolution + partition value extraction > > > Key: HUDI-7033 > URL: https://issues.apache.org/jira/browse/HUDI-7033 > Project: Apache Hudi > Issue Type: Bug >Reporter: voon >Priority: Major > Labels: pull-request-available > > After HUDI-6960 is merged, there > *shouldExtractPartitionValuesFromPartitionPath* will correctly ignore > partition columns in requiredSchema. > > When using the configs below, there will be read errors. > > {code:java} > hoodie.datasource.read.extract.partition.values.from.path = true {code} > > > When the config above is added together with: > > {code:java} > hoodie.schema.on.read.enable = true {code} > > The query schema will be pruned to **{*}NOT{*}** contain any partition > columns. > > When rebuilding parquet filters, file schema's columns are scanned against > querySchema. However, Hudi files (file schema) might still contain partition > columns. And when partition filters are being rebuilt with these file schema > against query schema, it will lead to partition columns not being found. > > {code:java} > Caused by: java.lang.IllegalArgumentException: cannot found filter col > name:region from querySchema: table { > 5: id: optional int > 6: name: optional string > 7: ts: optional long > } > at > org.apache.hudi.internal.schema.utils.InternalSchemaUtils.reBuildFilterName(InternalSchemaUtils.java:180) > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-7033) Fix read error for schema evolution + partition value extraction
[ https://issues.apache.org/jira/browse/HUDI-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859598#comment-17859598 ] Geser Dugarov edited comment on HUDI-7033 at 6/24/24 7:47 AM: -- Merged a4fa3451916de11dc082792076b62013586dadaf refers to [non-merged MR 9889|https://github.com/apache/hudi/pull/9889] was (Author: JIRAUSER301110): Merged a4fa3451916de11dc082792076b62013586dadaf refer to [non-merged MR 9889|https://github.com/apache/hudi/pull/9889] > Fix read error for schema evolution + partition value extraction > > > Key: HUDI-7033 > URL: https://issues.apache.org/jira/browse/HUDI-7033 > Project: Apache Hudi > Issue Type: Bug >Reporter: voon >Priority: Major > Labels: pull-request-available > > After HUDI-6960 is merged, there > *shouldExtractPartitionValuesFromPartitionPath* will correctly ignore > partition columns in requiredSchema. > > When using the configs below, there will be read errors. > > {code:java} > hoodie.datasource.read.extract.partition.values.from.path = true {code} > > > When the config above is added together with: > > {code:java} > hoodie.schema.on.read.enable = true {code} > > The query schema will be pruned to **{*}NOT{*}** contain any partition > columns. > > When rebuilding parquet filters, file schema's columns are scanned against > querySchema. However, Hudi files (file schema) might still contain partition > columns. And when partition filters are being rebuilt with these file schema > against query schema, it will lead to partition columns not being found. > > {code:java} > Caused by: java.lang.IllegalArgumentException: cannot found filter col > name:region from querySchema: table { > 5: id: optional int > 6: name: optional string > 7: ts: optional long > } > at > org.apache.hudi.internal.schema.utils.InternalSchemaUtils.reBuildFilterName(InternalSchemaUtils.java:180) > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] (HUDI-7033) Fix read error for schema evolution + partition value extraction
[ https://issues.apache.org/jira/browse/HUDI-7033 ] Geser Dugarov deleted comment on HUDI-7033: - was (Author: JIRAUSER301110): Fixed in master, a4fa3451916de11dc082792076b62013586dadaf > Fix read error for schema evolution + partition value extraction > > > Key: HUDI-7033 > URL: https://issues.apache.org/jira/browse/HUDI-7033 > Project: Apache Hudi > Issue Type: Bug >Reporter: voon >Priority: Major > Labels: pull-request-available > > After HUDI-6960 is merged, there > *shouldExtractPartitionValuesFromPartitionPath* will correctly ignore > partition columns in requiredSchema. > > When using the configs below, there will be read errors. > > {code:java} > hoodie.datasource.read.extract.partition.values.from.path = true {code} > > > When the config above is added together with: > > {code:java} > hoodie.schema.on.read.enable = true {code} > > The query schema will be pruned to **{*}NOT{*}** contain any partition > columns. > > When rebuilding parquet filters, file schema's columns are scanned against > querySchema. However, Hudi files (file schema) might still contain partition > columns. And when partition filters are being rebuilt with these file schema > against query schema, it will lead to partition columns not being found. > > {code:java} > Caused by: java.lang.IllegalArgumentException: cannot found filter col > name:region from querySchema: table { > 5: id: optional int > 6: name: optional string > 7: ts: optional long > } > at > org.apache.hudi.internal.schema.utils.InternalSchemaUtils.reBuildFilterName(InternalSchemaUtils.java:180) > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7033) Fix read error for schema evolution + partition value extraction
[ https://issues.apache.org/jira/browse/HUDI-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov closed HUDI-7033. --- Resolution: Fixed Fixed in master, a4fa3451916de11dc082792076b62013586dadaf > Fix read error for schema evolution + partition value extraction > > > Key: HUDI-7033 > URL: https://issues.apache.org/jira/browse/HUDI-7033 > Project: Apache Hudi > Issue Type: Bug >Reporter: voon >Priority: Major > Labels: pull-request-available > > After HUDI-6960 is merged, there > *shouldExtractPartitionValuesFromPartitionPath* will correctly ignore > partition columns in requiredSchema. > > When using the configs below, there will be read errors. > > {code:java} > hoodie.datasource.read.extract.partition.values.from.path = true {code} > > > When the config above is added together with: > > {code:java} > hoodie.schema.on.read.enable = true {code} > > The query schema will be pruned to **{*}NOT{*}** contain any partition > columns. > > When rebuilding parquet filters, file schema's columns are scanned against > querySchema. However, Hudi files (file schema) might still contain partition > columns. And when partition filters are being rebuilt with these file schema > against query schema, it will lead to partition columns not being found. > > {code:java} > Caused by: java.lang.IllegalArgumentException: cannot found filter col > name:region from querySchema: table { > 5: id: optional int > 6: name: optional string > 7: ts: optional long > } > at > org.apache.hudi.internal.schema.utils.InternalSchemaUtils.reBuildFilterName(InternalSchemaUtils.java:180) > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6286) Overwrite mode should not delete old data
[ https://issues.apache.org/jira/browse/HUDI-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855820#comment-17855820 ] Geser Dugarov commented on HUDI-6286: - Note that in HoodieWriteUtils.validateTableConfig() we skip all conflicts check between new and existing table configurations if it's an Overwrite save mode. > Overwrite mode should not delete old data > - > > Key: HUDI-6286 > URL: https://issues.apache.org/jira/browse/HUDI-6286 > Project: Apache Hudi > Issue Type: Bug > Components: spark, writer-core >Reporter: Hui An >Assignee: Hui An >Priority: Major > Fix For: 1.1.0 > > > https://github.com/apache/hudi/pull/8076/files#r1127283648 > For *Overwrite* mode, we should not delete the basePath. Just overwrite the > existing data -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7847) Infer record merge mode during table upgrade
[ https://issues.apache.org/jira/browse/HUDI-7847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853633#comment-17853633 ] Geser Dugarov commented on HUDI-7847: - Thanks for mentioning. I will reuse it. > Infer record merge mode during table upgrade > > > Key: HUDI-7847 > URL: https://issues.apache.org/jira/browse/HUDI-7847 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Geser Dugarov >Priority: Major > Fix For: 1.0.0 > > > Record merge mode is required to dictate the merging behavior in release 1.x, > playing the same role as the payload class config in the release 0.x. During > table upgrade, we need to infer the record merge mode based on the payload > class so it's correctly set. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7847) Infer record merge mode during table upgrade
[ https://issues.apache.org/jira/browse/HUDI-7847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov reassigned HUDI-7847: --- Assignee: Geser Dugarov > Infer record merge mode during table upgrade > > > Key: HUDI-7847 > URL: https://issues.apache.org/jira/browse/HUDI-7847 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Geser Dugarov >Priority: Major > Fix For: 1.0.0 > > > Record merge mode is required to dictate the merging behavior in release 1.x, > playing the same role as the payload class config in the release 0.x. During > table upgrade, we need to infer the record merge mode based on the payload > class so it's correctly set. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7850) Makes hoodie.record.merge.mode mandatory upon creating the table and first write
[ https://issues.apache.org/jira/browse/HUDI-7850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov reassigned HUDI-7850: --- Assignee: Geser Dugarov > Makes hoodie.record.merge.mode mandatory upon creating the table and first > write > > > Key: HUDI-7850 > URL: https://issues.apache.org/jira/browse/HUDI-7850 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Geser Dugarov >Priority: Major > Fix For: 1.0.0 > > > Right now, "hoodie.record.merge.mode" is optional during writes as it is > inferred from the payload class name, payload type, and the record merger > strategy during the creation of the table properties. We should make this > config mandatory in release 1.0 and make other merge configs optional to > simplify the configuration experience. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7827) Bump io.airlift:aircompressor from 0.25 to 0.27
[ https://issues.apache.org/jira/browse/HUDI-7827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov closed HUDI-7827. --- Resolution: Fixed Fixed in master, d0c7de050a8900a29f5d127093b378b96f9c5158 > Bump io.airlift:aircompressor from 0.25 to 0.27 > --- > > Key: HUDI-7827 > URL: https://issues.apache.org/jira/browse/HUDI-7827 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-3204: Description: {color:#172b4d}Currently, b/c Spark by default omits partition values from the data files (instead encoding them into partition paths for partitioned tables), using `TimestampBasedKeyGenerator` w/ original timestamp based-column makes it impossible to retrieve the original value (reading from Spark) even though it's persisted in the data file as well.{color} {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.HoodieWriteConfig import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ import org.apache.hudi.hive.MultiPartKeysValueExtractor val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") // mor df.write.format("hudi"). option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). option("hoodie.datasource.write.recordkey.field", "id"). option("hoodie.datasource.write.partitionpath.field", "data_date"). option("hoodie.datasource.write.precombine.field", "ts"). option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). option("hoodie.deltastreamer.keygen.timebased.output.dateformat", "/MM/dd"). option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). option("hoodie.deltastreamer.keygen.timebased.input.dateformat", "-MM-dd"). mode(org.apache.spark.sql.SaveMode.Append). save("file:///tmp/hudi/issue_4417_mor") +---++--+--++---++---+---+--+ |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| id|name|age| ts| data_date| +---++--+--++---++---+---+--+ | 20220110172709324|20220110172709324...| 2| 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| | 20220110172709324|20220110172709324...| 1| 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| +---++--+--++---++---+---+--+ // can not query any data spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date = '2018-09-24'") // still can not query any data spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date = '2018/09/24'").show // cow df.write.format("hudi"). option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). option("hoodie.datasource.write.recordkey.field", "id"). option("hoodie.datasource.write.partitionpath.field", "data_date"). option("hoodie.datasource.write.precombine.field", "ts"). option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). option("hoodie.deltastreamer.keygen.timebased.output.dateformat", "/MM/dd"). option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). option("hoodie.deltastreamer.keygen.timebased.input.dateformat", "-MM-dd"). mode(org.apache.spark.sql.SaveMode.Append). save("file:///tmp/hudi/issue_4417_cow") +---++--+--++---++---+---+--+ |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| id|name|age| ts| data_date| +---++--+--++---++---+---+--+ | 20220110172721896|20220110172721896...| 2| 2018/09/24|81cc7819-a0d1-4e6...| 2| z3| 35| v1|2018/09/24| | 20220110172721896|20220110172721896...| 1| 2018/09/23|d428019b-a829-41a...| 1| z3| 30| v1|2018/09/23| +---++--+--++---++---+---+--+ // can not query any data spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_cow").where("data_date = '2018-09-24'").show // but 2018/09/24 works spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_cow").where("data_date = '2018/09/24'").show {code} was: {color:#172b4d}Currently, b/c Spark by default omits partition values from the data files (instead encoding them into partition paths for partitioned tables), using `TimestampBasedKeyGenerator` w/ original
[jira] [Comment Edited] (HUDI-7709) Class Cast Exception while reading the data using TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848040#comment-17848040 ] Geser Dugarov edited comment on HUDI-7709 at 5/21/24 4:23 AM: -- The issue is related to HUDI-3204. Spark by default retrieves values for partitioning column from partition paths. We couldn't do it for TimestampBasedKeyGenerator due to lost data after user defined transformations in "hoodie.keygen.timebased.output.dateformat". Looking for proper fixing. was (Author: JIRAUSER301110): The issue related to HUDI-3204. Spark by default retrieves values for partitioning column from partition paths. We couldn't do it for TimestampBasedKeyGenerator due to lost data after user defined transformations in "hoodie.keygen.timebased.output.dateformat". Looking for proper fixing. > Class Cast Exception while reading the data using TimestampBasedKeyGenerator > > > Key: HUDI-7709 > URL: https://issues.apache.org/jira/browse/HUDI-7709 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: Aditya Goenka >Assignee: Geser Dugarov >Priority: Critical > Fix For: 1.0.0 > > > Github Issue - [https://github.com/apache/hudi/issues/11140] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-7709) Class Cast Exception while reading the data using TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848040#comment-17848040 ] Geser Dugarov edited comment on HUDI-7709 at 5/21/24 4:22 AM: -- The issue related to HUDI-3204. Spark by default retrieves values for partitioning column from partition paths. We couldn't do it for TimestampBasedKeyGenerator due to lost data after user defined transformations in "hoodie.keygen.timebased.output.dateformat". Looking for proper fixing. was (Author: JIRAUSER301110): The issue related to HUDI-3204. Spark by default retrieve values for partitioning column from partition paths. We couldn't do it for TimestampBasedKeyGenerator due to lost data after user defined transformations in hoodie.keygen.timebased.output.dateformat. > Class Cast Exception while reading the data using TimestampBasedKeyGenerator > > > Key: HUDI-7709 > URL: https://issues.apache.org/jira/browse/HUDI-7709 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: Aditya Goenka >Assignee: Geser Dugarov >Priority: Critical > Fix For: 1.0.0 > > > Github Issue - [https://github.com/apache/hudi/issues/11140] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7709) Class Cast Exception while reading the data using TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848040#comment-17848040 ] Geser Dugarov commented on HUDI-7709: - The issue related to HUDI-3204. Spark by default retrieve values for partitioning column from partition paths. We couldn't do it for TimestampBasedKeyGenerator due to lost data after user defined transformations in hoodie.keygen.timebased.output.dateformat. > Class Cast Exception while reading the data using TimestampBasedKeyGenerator > > > Key: HUDI-7709 > URL: https://issues.apache.org/jira/browse/HUDI-7709 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: Aditya Goenka >Assignee: Geser Dugarov >Priority: Critical > Fix For: 1.0.0 > > > Github Issue - [https://github.com/apache/hudi/issues/11140] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7709) Class Cast Exception while reading the data using TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7709: Status: In Progress (was: Open) > Class Cast Exception while reading the data using TimestampBasedKeyGenerator > > > Key: HUDI-7709 > URL: https://issues.apache.org/jira/browse/HUDI-7709 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: Aditya Goenka >Assignee: Geser Dugarov >Priority: Critical > Fix For: 1.0.0 > > > Github Issue - [https://github.com/apache/hudi/issues/11140] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7709) Class Cast Exception while reading the data using TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov reassigned HUDI-7709: --- Assignee: Geser Dugarov > Class Cast Exception while reading the data using TimestampBasedKeyGenerator > > > Key: HUDI-7709 > URL: https://issues.apache.org/jira/browse/HUDI-7709 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: Aditya Goenka >Assignee: Geser Dugarov >Priority: Critical > Fix For: 0.15.0 > > > Github Issue - [https://github.com/apache/hudi/issues/11140] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7709) Class Cast Exception while reading the data using TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7709: Fix Version/s: 1.0.0 (was: 0.15.0) > Class Cast Exception while reading the data using TimestampBasedKeyGenerator > > > Key: HUDI-7709 > URL: https://issues.apache.org/jira/browse/HUDI-7709 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: Aditya Goenka >Assignee: Geser Dugarov >Priority: Critical > Fix For: 1.0.0 > > > Github Issue - [https://github.com/apache/hudi/issues/11140] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] (HUDI-7717) hoodie.combine.before.insert silently broken for bulk_insert if meta fields disabled (causes duplicates)
[ https://issues.apache.org/jira/browse/HUDI-7717 ] Geser Dugarov deleted comment on HUDI-7717: - was (Author: JIRAUSER301110): Fixed in master branch: 7fc5adad7aa9787e961c36536a08622f62fabe49 > hoodie.combine.before.insert silently broken for bulk_insert if meta fields > disabled (causes duplicates) > > > Key: HUDI-7717 > URL: https://issues.apache.org/jira/browse/HUDI-7717 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Aditya Goenka >Assignee: Geser Dugarov >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0 > > > Github issue - [https://github.com/apache/hudi/issues/11044] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7717) hoodie.combine.before.insert silently broken for bulk_insert if meta fields disabled (causes duplicates)
[ https://issues.apache.org/jira/browse/HUDI-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847023#comment-17847023 ] Geser Dugarov commented on HUDI-7717: - Fixed in master branch: 7fc5adad7aa9787e961c36536a08622f62fabe49 > hoodie.combine.before.insert silently broken for bulk_insert if meta fields > disabled (causes duplicates) > > > Key: HUDI-7717 > URL: https://issues.apache.org/jira/browse/HUDI-7717 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Aditya Goenka >Assignee: Geser Dugarov >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0 > > > Github issue - [https://github.com/apache/hudi/issues/11044] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7717) hoodie.combine.before.insert silently broken for bulk_insert if meta fields disabled (causes duplicates)
[ https://issues.apache.org/jira/browse/HUDI-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov closed HUDI-7717. --- Resolution: Fixed Fixed in master branch: 7fc5adad7aa9787e961c36536a08622f62fabe49 > hoodie.combine.before.insert silently broken for bulk_insert if meta fields > disabled (causes duplicates) > > > Key: HUDI-7717 > URL: https://issues.apache.org/jira/browse/HUDI-7717 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Aditya Goenka >Assignee: Geser Dugarov >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0 > > > Github issue - [https://github.com/apache/hudi/issues/11044] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-7717) hoodie.combine.before.insert silently broken for bulk_insert if meta fields disabled (causes duplicates)
[ https://issues.apache.org/jira/browse/HUDI-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov resolved HUDI-7717. - > hoodie.combine.before.insert silently broken for bulk_insert if meta fields > disabled (causes duplicates) > > > Key: HUDI-7717 > URL: https://issues.apache.org/jira/browse/HUDI-7717 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Aditya Goenka >Assignee: Geser Dugarov >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0 > > > Github issue - [https://github.com/apache/hudi/issues/11044] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7717) hoodie.combine.before.insert silently broken for bulk_insert if meta fields disabled (causes duplicates)
[ https://issues.apache.org/jira/browse/HUDI-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846808#comment-17846808 ] Geser Dugarov commented on HUDI-7717: - MR with fix is under review. > hoodie.combine.before.insert silently broken for bulk_insert if meta fields > disabled (causes duplicates) > > > Key: HUDI-7717 > URL: https://issues.apache.org/jira/browse/HUDI-7717 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Aditya Goenka >Assignee: Geser Dugarov >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0 > > > Github issue - [https://github.com/apache/hudi/issues/11044] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7717) hoodie.combine.before.insert silently broken for bulk_insert if meta fields disabled (causes duplicates)
[ https://issues.apache.org/jira/browse/HUDI-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7717: Fix Version/s: 1.0.0 (was: 0.15.0) > hoodie.combine.before.insert silently broken for bulk_insert if meta fields > disabled (causes duplicates) > > > Key: HUDI-7717 > URL: https://issues.apache.org/jira/browse/HUDI-7717 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Aditya Goenka >Assignee: Geser Dugarov >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0 > > > Github issue - [https://github.com/apache/hudi/issues/11044] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7757) Revisit shortcut for bulk insert with enabled row writer
[ https://issues.apache.org/jira/browse/HUDI-7757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7757: Description: There is return statement in the middle of huge function HoodieSparkSqlWrite.writeInternal(). > Revisit shortcut for bulk insert with enabled row writer > > > Key: HUDI-7757 > URL: https://issues.apache.org/jira/browse/HUDI-7757 > Project: Apache Hudi > Issue Type: Task >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > There is return statement in the middle of huge function > HoodieSparkSqlWrite.writeInternal(). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7757) Revisit shortcut for bulk insert with enabled row writer
Geser Dugarov created HUDI-7757: --- Summary: Revisit shortcut for bulk insert with enabled row writer Key: HUDI-7757 URL: https://issues.apache.org/jira/browse/HUDI-7757 Project: Apache Hudi Issue Type: Task Reporter: Geser Dugarov -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7757) Revisit shortcut for bulk insert with enabled row writer
[ https://issues.apache.org/jira/browse/HUDI-7757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov reassigned HUDI-7757: --- Assignee: Geser Dugarov > Revisit shortcut for bulk insert with enabled row writer > > > Key: HUDI-7757 > URL: https://issues.apache.org/jira/browse/HUDI-7757 > Project: Apache Hudi > Issue Type: Task >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7646) Consistent naming in Compaction service
[ https://issues.apache.org/jira/browse/HUDI-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7646: Fix Version/s: 1.0.0 > Consistent naming in Compaction service > --- > > Key: HUDI-7646 > URL: https://issues.apache.org/jira/browse/HUDI-7646 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Minor > Fix For: 1.0.0 > > > The set of configuration parameters for Compaction service is confusing. > In HoodieCompationConfig: > * hoodie.compact.inline > * hoodie.compact.schedule.inline > * hoodie.log.compaction.enable > * hoodie.log.compaction.inline > * hoodie.compact.inline.max.delta.commits > * hoodie.compact.inline.max.delta.seconds > * hoodie.compact.inline.trigger.strategy > * hoodie.parquet.small.file.limit > * hoodie.record.size.estimation.threshold > * hoodie.compaction.target.io > * hoodie.compaction.logfile.size.threshold > * hoodie.compaction.logfile.num.threshold > * hoodie.compaction.strategy > * hoodie.compaction.daybased.target.partitions > * hoodie.copyonwrite.insert.split.size > * hoodie.copyonwrite.insert.auto.split > * hoodie.copyonwrite.record.size.estimate > * hoodie.log.compaction.blocks.threshold > In FlinkOptions: > * compaction.async.enabled > * compaction.schedule.enabled > * compaction.delta_commits > * compaction.delta_seconds > * compaction.trigger.strategy > * compaction.target_io > * compaction.max_memory > * compaction.tasks > * compaction.timeout.seconds > Need to refactor naming with saving backward compatibility. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7646) Consistent naming in Compaction service
[ https://issues.apache.org/jira/browse/HUDI-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845082#comment-17845082 ] Geser Dugarov commented on HUDI-7646: - Prepared local environment for TPC-H benchmark running. I will research Compaction parameters configuration from user point of view. > Consistent naming in Compaction service > --- > > Key: HUDI-7646 > URL: https://issues.apache.org/jira/browse/HUDI-7646 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Minor > > The set of configuration parameters for Compaction service is confusing. > In HoodieCompationConfig: > * hoodie.compact.inline > * hoodie.compact.schedule.inline > * hoodie.log.compaction.enable > * hoodie.log.compaction.inline > * hoodie.compact.inline.max.delta.commits > * hoodie.compact.inline.max.delta.seconds > * hoodie.compact.inline.trigger.strategy > * hoodie.parquet.small.file.limit > * hoodie.record.size.estimation.threshold > * hoodie.compaction.target.io > * hoodie.compaction.logfile.size.threshold > * hoodie.compaction.logfile.num.threshold > * hoodie.compaction.strategy > * hoodie.compaction.daybased.target.partitions > * hoodie.copyonwrite.insert.split.size > * hoodie.copyonwrite.insert.auto.split > * hoodie.copyonwrite.record.size.estimate > * hoodie.log.compaction.blocks.threshold > In FlinkOptions: > * compaction.async.enabled > * compaction.schedule.enabled > * compaction.delta_commits > * compaction.delta_seconds > * compaction.trigger.strategy > * compaction.target_io > * compaction.max_memory > * compaction.tasks > * compaction.timeout.seconds > Need to refactor naming with saving backward compatibility. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7737) Bump Spark 3.4 version to Spark 3.4.3
[ https://issues.apache.org/jira/browse/HUDI-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov closed HUDI-7737. --- Fix Version/s: 1.0.0 Resolution: Fixed > Bump Spark 3.4 version to Spark 3.4.3 > - > > Key: HUDI-7737 > URL: https://issues.apache.org/jira/browse/HUDI-7737 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > > Spark 3.4.3 has been released: https://github.com/apache/spark/tree/v3.4.3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7737) Bump Spark 3.4 version to Spark 3.4.3
[ https://issues.apache.org/jira/browse/HUDI-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845076#comment-17845076 ] Geser Dugarov commented on HUDI-7737: - Fixed via master branch: cdd146b2c73d50a28bee9f712b689df4fc923222 > Bump Spark 3.4 version to Spark 3.4.3 > - > > Key: HUDI-7737 > URL: https://issues.apache.org/jira/browse/HUDI-7737 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Minor > Labels: pull-request-available > > Spark 3.4.3 has been released: https://github.com/apache/spark/tree/v3.4.3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-7737) Bump Spark 3.4 version to Spark 3.4.3
[ https://issues.apache.org/jira/browse/HUDI-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov resolved HUDI-7737. - > Bump Spark 3.4 version to Spark 3.4.3 > - > > Key: HUDI-7737 > URL: https://issues.apache.org/jira/browse/HUDI-7737 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Minor > Labels: pull-request-available > > Spark 3.4.3 has been released: https://github.com/apache/spark/tree/v3.4.3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7737) Bump Spark 3.4 version to Spark 3.4.3
[ https://issues.apache.org/jira/browse/HUDI-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7737: Priority: Minor (was: Major) > Bump Spark 3.4 version to Spark 3.4.3 > - > > Key: HUDI-7737 > URL: https://issues.apache.org/jira/browse/HUDI-7737 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Minor > Labels: pull-request-available > > Spark 3.4.3 has been released: https://github.com/apache/spark/tree/v3.4.3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-7717) hoodie.combine.before.insert silently broken for bulk_insert if meta fields disabled (causes duplicates)
[ https://issues.apache.org/jira/browse/HUDI-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844891#comment-17844891 ] Geser Dugarov edited comment on HUDI-7717 at 5/9/24 7:24 AM: - Working on local PySpark environment deployment and configuration for quick checking. I suppose that change of Spark SaveMode from Overwrite to Append could lead to expected behavior. was (Author: JIRAUSER301110): Working on local PySpark environment setting for quick checking. I suppose that change of Spark SaveMode from Overwrite to Append could lead to expected behavior. > hoodie.combine.before.insert silently broken for bulk_insert if meta fields > disabled (causes duplicates) > > > Key: HUDI-7717 > URL: https://issues.apache.org/jira/browse/HUDI-7717 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Aditya Goenka >Assignee: Geser Dugarov >Priority: Critical > Fix For: 0.15.0 > > > Github issue - [https://github.com/apache/hudi/issues/11044] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7737) Bump Spark 3.4 version to Spark 3.4.3
[ https://issues.apache.org/jira/browse/HUDI-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7737: Description: Spark 3.4.3 has been released: https://github.com/apache/spark/tree/v3.4.3 > Bump Spark 3.4 version to Spark 3.4.3 > - > > Key: HUDI-7737 > URL: https://issues.apache.org/jira/browse/HUDI-7737 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > Spark 3.4.3 has been released: https://github.com/apache/spark/tree/v3.4.3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7737) Bump Spark 3.4 version to Spark 3.4.3
Geser Dugarov created HUDI-7737: --- Summary: Bump Spark 3.4 version to Spark 3.4.3 Key: HUDI-7737 URL: https://issues.apache.org/jira/browse/HUDI-7737 Project: Apache Hudi Issue Type: Improvement Reporter: Geser Dugarov -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7737) Bump Spark 3.4 version to Spark 3.4.3
[ https://issues.apache.org/jira/browse/HUDI-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov reassigned HUDI-7737: --- Assignee: Geser Dugarov > Bump Spark 3.4 version to Spark 3.4.3 > - > > Key: HUDI-7737 > URL: https://issues.apache.org/jira/browse/HUDI-7737 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7717) hoodie.combine.before.insert silently broken for bulk_insert if meta fields disabled (causes duplicates)
[ https://issues.apache.org/jira/browse/HUDI-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov reassigned HUDI-7717: --- Assignee: Geser Dugarov > hoodie.combine.before.insert silently broken for bulk_insert if meta fields > disabled (causes duplicates) > > > Key: HUDI-7717 > URL: https://issues.apache.org/jira/browse/HUDI-7717 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Aditya Goenka >Assignee: Geser Dugarov >Priority: Critical > Fix For: 0.15.0 > > > Github issue - [https://github.com/apache/hudi/issues/11044] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7646) Consistent naming in Compaction service
[ https://issues.apache.org/jira/browse/HUDI-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov reassigned HUDI-7646: --- Assignee: Geser Dugarov > Consistent naming in Compaction service > --- > > Key: HUDI-7646 > URL: https://issues.apache.org/jira/browse/HUDI-7646 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Minor > > The set of configuration parameters for Compaction service is confusing. > In HoodieCompationConfig: > * hoodie.compact.inline > * hoodie.compact.schedule.inline > * hoodie.log.compaction.enable > * hoodie.log.compaction.inline > * hoodie.compact.inline.max.delta.commits > * hoodie.compact.inline.max.delta.seconds > * hoodie.compact.inline.trigger.strategy > * hoodie.parquet.small.file.limit > * hoodie.record.size.estimation.threshold > * hoodie.compaction.target.io > * hoodie.compaction.logfile.size.threshold > * hoodie.compaction.logfile.num.threshold > * hoodie.compaction.strategy > * hoodie.compaction.daybased.target.partitions > * hoodie.copyonwrite.insert.split.size > * hoodie.copyonwrite.insert.auto.split > * hoodie.copyonwrite.record.size.estimate > * hoodie.log.compaction.blocks.threshold > In FlinkOptions: > * compaction.async.enabled > * compaction.schedule.enabled > * compaction.delta_commits > * compaction.delta_seconds > * compaction.trigger.strategy > * compaction.target_io > * compaction.max_memory > * compaction.tasks > * compaction.timeout.seconds > Need to refactor naming with saving backward compatibility. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7646) Consistent naming in Compaction service
[ https://issues.apache.org/jira/browse/HUDI-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7646: Status: In Progress (was: Open) > Consistent naming in Compaction service > --- > > Key: HUDI-7646 > URL: https://issues.apache.org/jira/browse/HUDI-7646 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Geser Dugarov >Priority: Minor > > The set of configuration parameters for Compaction service is confusing. > In HoodieCompationConfig: > * hoodie.compact.inline > * hoodie.compact.schedule.inline > * hoodie.log.compaction.enable > * hoodie.log.compaction.inline > * hoodie.compact.inline.max.delta.commits > * hoodie.compact.inline.max.delta.seconds > * hoodie.compact.inline.trigger.strategy > * hoodie.parquet.small.file.limit > * hoodie.record.size.estimation.threshold > * hoodie.compaction.target.io > * hoodie.compaction.logfile.size.threshold > * hoodie.compaction.logfile.num.threshold > * hoodie.compaction.strategy > * hoodie.compaction.daybased.target.partitions > * hoodie.copyonwrite.insert.split.size > * hoodie.copyonwrite.insert.auto.split > * hoodie.copyonwrite.record.size.estimate > * hoodie.log.compaction.blocks.threshold > In FlinkOptions: > * compaction.async.enabled > * compaction.schedule.enabled > * compaction.delta_commits > * compaction.delta_seconds > * compaction.trigger.strategy > * compaction.target_io > * compaction.max_memory > * compaction.tasks > * compaction.timeout.seconds > Need to refactor naming with saving backward compatibility. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7646) Consistent naming in Compaction service
[ https://issues.apache.org/jira/browse/HUDI-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839541#comment-17839541 ] Geser Dugarov commented on HUDI-7646: - The main question is using ".inline" vs ".async". The current distribution is the following. Using ".inline": * hoodie.compact.inline * hoodie.compact.schedule.inline * hoodie.log.compaction.inline * hoodie.clustering.inline * hoodie.clustering.schedule.inline * hoodie.partition.ttl.inline Using ".async": * hoodie.clean.async.enabled * clean.async.enabled * compaction.async.enabled * hoodie.kafka.compaction.async.enable * hoodie.clustering.async.enabled * clustering.async.enabled * hoodie.archive.async * hoodie.embed.timeline.server.async * hoodie.metadata.index.async * hoodie.datasource.compaction.async.enable Looks like it's preferable to move toward ".async" option. And from user point of view, it's more obvious what ".async" means in comparing with ".inline", which needs to clarify the Hudi write process for a user. > Consistent naming in Compaction service > --- > > Key: HUDI-7646 > URL: https://issues.apache.org/jira/browse/HUDI-7646 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Geser Dugarov >Priority: Minor > > The set of configuration parameters for Compaction service is confusing. > In HoodieCompationConfig: > * hoodie.compact.inline > * hoodie.compact.schedule.inline > * hoodie.log.compaction.enable > * hoodie.log.compaction.inline > * hoodie.compact.inline.max.delta.commits > * hoodie.compact.inline.max.delta.seconds > * hoodie.compact.inline.trigger.strategy > * hoodie.parquet.small.file.limit > * hoodie.record.size.estimation.threshold > * hoodie.compaction.target.io > * hoodie.compaction.logfile.size.threshold > * hoodie.compaction.logfile.num.threshold > * hoodie.compaction.strategy > * hoodie.compaction.daybased.target.partitions > * hoodie.copyonwrite.insert.split.size > * hoodie.copyonwrite.insert.auto.split > * hoodie.copyonwrite.record.size.estimate > * hoodie.log.compaction.blocks.threshold > In FlinkOptions: > * compaction.async.enabled > * compaction.schedule.enabled > * compaction.delta_commits > * compaction.delta_seconds > * compaction.trigger.strategy > * compaction.target_io > * compaction.max_memory > * compaction.tasks > * compaction.timeout.seconds > Need to refactor naming with saving backward compatibility. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-7646) Consistent naming in Compaction service
[ https://issues.apache.org/jira/browse/HUDI-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839541#comment-17839541 ] Geser Dugarov edited comment on HUDI-7646 at 4/22/24 8:11 AM: -- The main question is what options are preferable, with ".inline" or with ".async" naming. The current distribution is the following. Using ".inline": * hoodie.compact.inline * hoodie.compact.schedule.inline * hoodie.log.compaction.inline * hoodie.clustering.inline * hoodie.clustering.schedule.inline * hoodie.partition.ttl.inline Using ".async": * hoodie.clean.async.enabled * clean.async.enabled * compaction.async.enabled * hoodie.kafka.compaction.async.enable * hoodie.clustering.async.enabled * clustering.async.enabled * hoodie.archive.async * hoodie.embed.timeline.server.async * hoodie.metadata.index.async * hoodie.datasource.compaction.async.enable Looks like it's preferable to move toward ".async" option. And from user point of view, it's more obvious what ".async" means in comparing with ".inline", which needs to clarify the Hudi write process for a user. was (Author: JIRAUSER301110): The main question is using ".inline" vs ".async". The current distribution is the following. Using ".inline": * hoodie.compact.inline * hoodie.compact.schedule.inline * hoodie.log.compaction.inline * hoodie.clustering.inline * hoodie.clustering.schedule.inline * hoodie.partition.ttl.inline Using ".async": * hoodie.clean.async.enabled * clean.async.enabled * compaction.async.enabled * hoodie.kafka.compaction.async.enable * hoodie.clustering.async.enabled * clustering.async.enabled * hoodie.archive.async * hoodie.embed.timeline.server.async * hoodie.metadata.index.async * hoodie.datasource.compaction.async.enable Looks like it's preferable to move toward ".async" option. And from user point of view, it's more obvious what ".async" means in comparing with ".inline", which needs to clarify the Hudi write process for a user. > Consistent naming in Compaction service > --- > > Key: HUDI-7646 > URL: https://issues.apache.org/jira/browse/HUDI-7646 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Geser Dugarov >Priority: Minor > > The set of configuration parameters for Compaction service is confusing. > In HoodieCompationConfig: > * hoodie.compact.inline > * hoodie.compact.schedule.inline > * hoodie.log.compaction.enable > * hoodie.log.compaction.inline > * hoodie.compact.inline.max.delta.commits > * hoodie.compact.inline.max.delta.seconds > * hoodie.compact.inline.trigger.strategy > * hoodie.parquet.small.file.limit > * hoodie.record.size.estimation.threshold > * hoodie.compaction.target.io > * hoodie.compaction.logfile.size.threshold > * hoodie.compaction.logfile.num.threshold > * hoodie.compaction.strategy > * hoodie.compaction.daybased.target.partitions > * hoodie.copyonwrite.insert.split.size > * hoodie.copyonwrite.insert.auto.split > * hoodie.copyonwrite.record.size.estimate > * hoodie.log.compaction.blocks.threshold > In FlinkOptions: > * compaction.async.enabled > * compaction.schedule.enabled > * compaction.delta_commits > * compaction.delta_seconds > * compaction.trigger.strategy > * compaction.target_io > * compaction.max_memory > * compaction.tasks > * compaction.timeout.seconds > Need to refactor naming with saving backward compatibility. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7646) Consistent naming in Compaction service
[ https://issues.apache.org/jira/browse/HUDI-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7646: Description: The set of configuration parameters for Compaction service is confusing. In HoodieCompationConfig: * hoodie.compact.inline * hoodie.compact.schedule.inline * hoodie.log.compaction.enable * hoodie.log.compaction.inline * hoodie.compact.inline.max.delta.commits * hoodie.compact.inline.max.delta.seconds * hoodie.compact.inline.trigger.strategy * hoodie.parquet.small.file.limit * hoodie.record.size.estimation.threshold * hoodie.compaction.target.io * hoodie.compaction.logfile.size.threshold * hoodie.compaction.logfile.num.threshold * hoodie.compaction.strategy * hoodie.compaction.daybased.target.partitions * hoodie.copyonwrite.insert.split.size * hoodie.copyonwrite.insert.auto.split * hoodie.copyonwrite.record.size.estimate * hoodie.log.compaction.blocks.threshold In FlinkOptions: * compaction.async.enabled * compaction.schedule.enabled * compaction.delta_commits * compaction.delta_seconds * compaction.trigger.strategy * compaction.target_io * compaction.max_memory * compaction.tasks * compaction.timeout.seconds Need to refactor naming with saving backward compatibility. Priority: Minor (was: Major) > Consistent naming in Compaction service > --- > > Key: HUDI-7646 > URL: https://issues.apache.org/jira/browse/HUDI-7646 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Geser Dugarov >Priority: Minor > > The set of configuration parameters for Compaction service is confusing. > In HoodieCompationConfig: > * hoodie.compact.inline > * hoodie.compact.schedule.inline > * hoodie.log.compaction.enable > * hoodie.log.compaction.inline > * hoodie.compact.inline.max.delta.commits > * hoodie.compact.inline.max.delta.seconds > * hoodie.compact.inline.trigger.strategy > * hoodie.parquet.small.file.limit > * hoodie.record.size.estimation.threshold > * hoodie.compaction.target.io > * hoodie.compaction.logfile.size.threshold > * hoodie.compaction.logfile.num.threshold > * hoodie.compaction.strategy > * hoodie.compaction.daybased.target.partitions > * hoodie.copyonwrite.insert.split.size > * hoodie.copyonwrite.insert.auto.split > * hoodie.copyonwrite.record.size.estimate > * hoodie.log.compaction.blocks.threshold > In FlinkOptions: > * compaction.async.enabled > * compaction.schedule.enabled > * compaction.delta_commits > * compaction.delta_seconds > * compaction.trigger.strategy > * compaction.target_io > * compaction.max_memory > * compaction.tasks > * compaction.timeout.seconds > Need to refactor naming with saving backward compatibility. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7646) Consistent naming in Compaction service
Geser Dugarov created HUDI-7646: --- Summary: Consistent naming in Compaction service Key: HUDI-7646 URL: https://issues.apache.org/jira/browse/HUDI-7646 Project: Apache Hudi Issue Type: Improvement Reporter: Geser Dugarov -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-6438) Fix issue while inserting non-nullable array columns to nullable columns
[ https://issues.apache.org/jira/browse/HUDI-6438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825954#comment-17825954 ] Geser Dugarov edited comment on HUDI-6438 at 3/13/24 8:03 AM: -- First fix by commit 42799c0956f626bc47318ddd91c626b1e58a0fc8 in the master branch has been reverted, commit bc522a6ce4142510f43529798ef4217839d71624 in the master branch. The reason, that this issue has similar issue HUDI-6219, which has been fixed properly without adding new parameters, corresponding commit in the master branch ea547e5681a007e546b8ca8cb1399da0a4cd5012. was (Author: JIRAUSER301110): First fix by commit 42799c0956f626bc47318ddd91c626b1e58a0fc8 in the master branch has been reverted, commit bc522a6ce4142510f43529798ef4217839d71624 in the master branch. The reason, that this issue has similar issue [HUDI-6219|https://issues.apache.org/jira/browse/HUDI-6219], which has been fixed properly without adding new parameters. > Fix issue while inserting non-nullable array columns to nullable columns > > > Key: HUDI-6438 > URL: https://issues.apache.org/jira/browse/HUDI-6438 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Aditya Goenka >Priority: Critical > Labels: pull-request-available > Fix For: 1.1.0 > > > Github issue - [https://github.com/apache/hudi/issues/9042] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6438) Fix issue while inserting non-nullable array columns to nullable columns
[ https://issues.apache.org/jira/browse/HUDI-6438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825954#comment-17825954 ] Geser Dugarov commented on HUDI-6438: - First fix by commit 42799c0956f626bc47318ddd91c626b1e58a0fc8 in the master branch has been reverted, commit bc522a6ce4142510f43529798ef4217839d71624 in the master branch. The reason, that this issue has similar issue [HUDI-6219|https://issues.apache.org/jira/browse/HUDI-6219], which has been fixed properly without adding new parameters. > Fix issue while inserting non-nullable array columns to nullable columns > > > Key: HUDI-6438 > URL: https://issues.apache.org/jira/browse/HUDI-6438 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Aditya Goenka >Priority: Critical > Labels: pull-request-available > Fix For: 1.1.0 > > > Github issue - [https://github.com/apache/hudi/issues/9042] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6219) Ensure consistency between Spark catalog schema and Hudi schema
[ https://issues.apache.org/jira/browse/HUDI-6219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825953#comment-17825953 ] Geser Dugarov commented on HUDI-6219: - Fixed in the master branch by commit ea547e5681a007e546b8ca8cb1399da0a4cd5012. > Ensure consistency between Spark catalog schema and Hudi schema > --- > > Key: HUDI-6219 > URL: https://issues.apache.org/jira/browse/HUDI-6219 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wechar >Priority: Major > Labels: pull-request-available > > [HUDI-4149|https://github.com/apache/hudi/pull/5672] fix the drop table error > if table directory moved, but it will make the Spark catalog table schema not > consistent with Hudi schema if some column types are not Avro data types. > *Root cause:* > Hudi schema is Avro types, but Spark catalog table schema is not. There are > two steps to record schema when create a hudi table: > Step1: record the Avro compatible schema to .hoodie/hoodie.properties, > Step2: record table in Spark catalog > The Step2 will use HoodieCatalog.tableSchema, which is table.schema now and > cause this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-7493) Clean configuration for clean service
[ https://issues.apache.org/jira/browse/HUDI-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825613#comment-17825613 ] Geser Dugarov edited comment on HUDI-7493 at 3/12/24 12:17 PM: --- Could be labeled by the "Config Simplification" epic. was (Author: JIRAUSER301110): Could be label by the ["Config Simplification" epic|https://issues.apache.org/jira/browse/HUDI-5738]. > Clean configuration for clean service > - > > Key: HUDI-7493 > URL: https://issues.apache.org/jira/browse/HUDI-7493 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > > Sometimes we use `{{{}hoodie.clean.*`{}}} and sometimes > `{{{}hoodie.cleaner.*`.{}}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7493) Clean configuration for clean service
[ https://issues.apache.org/jira/browse/HUDI-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825613#comment-17825613 ] Geser Dugarov commented on HUDI-7493: - Could be label by the ["Config Simplification" epic|https://issues.apache.org/jira/browse/HUDI-5738]. > Clean configuration for clean service > - > > Key: HUDI-7493 > URL: https://issues.apache.org/jira/browse/HUDI-7493 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > > Sometimes we use `{{{}hoodie.clean.*`{}}} and sometimes > `{{{}hoodie.cleaner.*`.{}}} -- This message was sent by Atlassian Jira (v8.20.10#820010)