[jira] [Updated] (HUDI-7938) Missed HoodieSparkKryoRegistrar in Hadoop config by default

2024-07-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7938:
-
Labels: pull-request-available  (was: )

> Missed HoodieSparkKryoRegistrar in Hadoop config by default
> ---
>
> Key: HUDI-7938
> URL: https://issues.apache.org/jira/browse/HUDI-7938
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>  Labels: pull-request-available
>
> HUDI-7567 Add schema evolution to the filegroup reader (#10957),
> but broke integration with PySpark.
> When trying to call
> {quote}df_load = 
> spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path)
> df_load.collect()
> {quote}
>  
> got:
>  
> {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 
> (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException
>     at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:139)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> {quote}
> Spark 3.4.3 was used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7980) Optimize the configuration content when performing clustering with row writer

2024-07-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7980:
-
Labels: pull-request-available  (was: )

> Optimize the configuration content when performing clustering with row writer
> -
>
> Key: HUDI-7980
> URL: https://issues.apache.org/jira/browse/HUDI-7980
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ma Jian
>Priority: Major
>  Labels: pull-request-available
>
> Currently, the row writer defaults to snapshot reads for all tables. However, 
> this method is relatively inefficient for MOR (Merge on Read) tables when 
> there are no logs. Therefore, we should optimize this part of the 
> configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7976) Fix BUG introduced in HUDI-7955 due to usage of wrong class

2024-07-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7976:
-
Labels: pull-request-available  (was: )

> Fix BUG introduced in HUDI-7955 due to usage of wrong class
> ---
>
> Key: HUDI-7976
> URL: https://issues.apache.org/jira/browse/HUDI-7976
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: voon
>Assignee: voon
>Priority: Major
>  Labels: pull-request-available
>
> In the bugfix for HUDI-7955, the wrong class for invoking {{getTimestamp 
> }}was used.
>  # {*}Wrong{*}: org.apache.hadoop.hive.common.type.Timestamp
>  # {*}Correct{*}: org.apache.hadoop.hive.serde2.io.TimestampWritableV2
>  
> !https://git.garena.com/shopee/data-infra/hudi/uploads/eeff29b3e741c65eeb48f9901fa28da0/image.png|width=468,height=235!
>  
> Submitting a bugfix to fix this bugfix... 
> Log levels for the exception block is also changed to warn so errors will be 
> printed out.
> On top of that, we have simplified the {{getMillis}} shim to remove the 
> method that was added in HUDI-7955 to standardise it with how {{getDays}} is 
> written.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7979) Fix out of the box defaults with spillable memory configs

2024-07-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7979:
-
Labels: pull-request-available  (was: )

> Fix out of the box defaults with spillable memory configs 
> --
>
> Key: HUDI-7979
> URL: https://issues.apache.org/jira/browse/HUDI-7979
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core, writer-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> Looks like we are very conservative wrt memory configs used for spillable map 
> based FSV. 
>  
> For eg, we are only allocating 15Mb out of the box to file groups when using 
> spillable map based FSV.
>  public long getMaxMemoryForFileGroupMap() \{
> long totalMemory = getLong(SPILLABLE_MEMORY);
> return totalMemory - getMaxMemoryForPendingCompaction() - 
> getMaxMemoryForBootstrapBaseFile();
>   }
>  
> SPILLABLE_MEMORY = default is 100Mb.
> getMaxMemoryForPendingCompaction = 80% of 100MB.
> getMaxMemoryForBootstrapBaseFile = 5% of 100Mb.
> so, overall, out of the box we are allocating only 15Mb for 
> getMaxMemoryForFileGroupMap.
> ref: 
> [https://github.com/apache/hudi/blob/bb0621edee97507cf2460e8cb57b5307510b917e/hudi-[…]/apache/hudi/common/table/view/FileSystemViewStorageConfig.java|https://github.com/apache/hudi/blob/bb0621edee97507cf2460e8cb57b5307510b917e/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewStorageConfig.java#L224]
> Wondering do we even need 80% for pending compaction tracker in our FSV. I am 
> thinking to make it 15%. so that we can give more memory to actual file 
> groups. We may not have lot of pending compactions for a given table. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7978) Update docs for older versions to state that partitions should be ordered when creating multiple partitions

2024-07-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7978:
-
Labels: pull-request-available  (was: )

> Update docs for older versions to state that partitions should be ordered 
> when creating multiple partitions
> ---
>
> Key: HUDI-7978
> URL: https://issues.apache.org/jira/browse/HUDI-7978
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: docs
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7977) improve bucket index paritioner

2024-07-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7977:
-
Labels: pull-request-available  (was: )

> improve bucket index paritioner
> ---
>
> Key: HUDI-7977
> URL: https://issues.apache.org/jira/browse/HUDI-7977
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: index
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
>  Labels: pull-request-available
>
> imporve {{BucketIndexUtil}} partitionIndex algorithm make the data be evenly 
> distributed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7975) Transfer extrametada to new commits when new data is not ingeested to trigger table services on the dataset

2024-07-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7975:
-
Labels: pull-request-available  (was: )

> Transfer extrametada to new commits when new data is not ingeested to trigger 
> table services on the dataset
> ---
>
> Key: HUDI-7975
> URL: https://issues.apache.org/jira/browse/HUDI-7975
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Surya Prasanna Yalla
>Assignee: Surya Prasanna Yalla
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7974) Create empty clean commit at a cadence and make it configurable

2024-07-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7974:
-
Labels: pull-request-available  (was: )

> Create empty clean commit at a cadence and make it configurable
> ---
>
> Key: HUDI-7974
> URL: https://issues.apache.org/jira/browse/HUDI-7974
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Surya Prasanna Yalla
>Assignee: Surya Prasanna Yalla
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7970) Add support to read partition fields when partition type is also stored in table config

2024-07-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7970:
-
Labels: pull-request-available  (was: )

> Add support to read partition fields when partition type is also stored in 
> table config
> ---
>
> Key: HUDI-7970
> URL: https://issues.apache.org/jira/browse/HUDI-7970
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>
> In HUDI-7902, we will modify the config value `hoodie.table.partition.fields` 
> to also store partition type. This PR aims to make sure that the getter and 
> other functions accessing this field remain consistent in behaviour with the 
> new value type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7969) Fix data loss caused by concurrent write and clean

2024-07-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7969:
-
Labels: pull-request-available  (was: )

> Fix data loss caused by concurrent write and clean
> --
>
> Key: HUDI-7969
> URL: https://issues.apache.org/jira/browse/HUDI-7969
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Xinyu Zou
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7692) Move MDT partiiton type code in HoodieMetadataPaylaod to MetadataPartitionType

2024-07-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7692:
-
Labels: hudi-1.0.0-beta2 pull-request-available  (was: hudi-1.0.0-beta2)

> Move MDT partiiton type code in HoodieMetadataPaylaod to MetadataPartitionType
> --
>
> Key: HUDI-7692
> URL: https://issues.apache.org/jira/browse/HUDI-7692
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> https://github.com/apache/hudi/pull/10352#discussion_r1584137942



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7025) Merge Index and Functional Index Config

2024-07-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7025:
-
Labels: hudi-1.0.0-beta2 pull-request-available  (was: hudi-1.0.0-beta2)

> Merge Index and Functional Index Config
> ---
>
> Key: HUDI-7025
> URL: https://issues.apache.org/jira/browse/HUDI-7025
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Minor
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> There is {{INDEX}} sub-group name in `ConfigGroups`. Functional index configs 
> can be consolidated within that.
>  
> https://github.com/apache/hudi/pull/9872#discussion_r1377115549



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7967) Robust handling of spark task failures and retries

2024-07-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7967:
-
Labels: RobustWrites pull-request-available  (was: RobustWrites)

> Robust handling of spark task failures and retries 
> ---
>
> Key: HUDI-7967
> URL: https://issues.apache.org/jira/browse/HUDI-7967
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: reader-core, writer-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: RobustWrites, pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7968) RFC for robust handling of spark task failures and retries

2024-07-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7968:
-
Labels: RobustWrites pull-request-available  (was: RobustWrites)

> RFC for robust handling of spark task failures and retries
> --
>
> Key: HUDI-7968
> URL: https://issues.apache.org/jira/browse/HUDI-7968
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: RobustWrites, pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7962) Add show create table command

2024-07-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7962:
-
Labels: pull-request-available  (was: )

> Add show create table command
> -
>
> Key: HUDI-7962
> URL: https://issues.apache.org/jira/browse/HUDI-7962
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: cli
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7966) NPE from AvroSchemaUtils.createNewSchemaFromFieldsWithReference

2024-07-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7966:
-
Labels: pull-request-available  (was: )

> NPE from AvroSchemaUtils.createNewSchemaFromFieldsWithReference
> ---
>
> Key: HUDI-7966
> URL: https://issues.apache.org/jira/browse/HUDI-7966
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Running 
> [long-running|https://github.com/apache/hudi/blob/dbfe8b23c0b4f160b26379053873cfc2a46acef4/docker/demo/config/test-suite/spark-long-running-non-partitioned.yaml]
>  deltastreamer with following properties: 
> [https://github.com/apache/hudi/blob/dbfe8b23c0b4f160b26379053873cfc2a46acef4/docker/demo/config/test-suite/test-nonpartitioned.properties]
> The job throws NPE during validation phase:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 69.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 69.0 (TID 345) (10.0.103.207 executor 1): 
> java.lang.NullPointerException  at 
> org.apache.avro.JsonProperties$2$1$1.(JsonProperties.java:175)  at 
> org.apache.avro.JsonProperties$2$1.iterator(JsonProperties.java:174)  at 
> org.apache.avro.JsonProperties.getObjectProps(JsonProperties.java:305)  at 
> org.apache.hudi.avro.AvroSchemaUtils.createNewSchemaFromFieldsWithReference(AvroSchemaUtils.java:306)
>   at 
> org.apache.hudi.avro.AvroSchemaUtils.appendFieldsToSchemaBase(AvroSchemaUtils.java:293)
>   at 
> org.apache.hudi.avro.AvroSchemaUtils.appendFieldsToSchemaDedupNested(AvroSchemaUtils.java:245)
>   at 
> org.apache.hudi.common.table.read.HoodieFileGroupReaderSchemaHandler.generateRequiredSchema(HoodieFileGroupReaderSchemaHandler.java:146)
>   at 
> org.apache.hudi.common.table.read.HoodieFileGroupReaderSchemaHandler.prepareRequiredSchema(HoodieFileGroupReaderSchemaHandler.java:150)
>   at 
> org.apache.hudi.common.table.read.HoodieFileGroupReaderSchemaHandler.(HoodieFileGroupReaderSchemaHandler.java:84)
>   at 
> org.apache.hudi.common.table.read.HoodieFileGroupReader.(HoodieFileGroupReader.java:113)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$3(HoodieFileGroupReaderBasedParquetFileFormat.scala:170)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:209)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:270)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown
>  Source)  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)  at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)  at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)  
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) 
>  at org.apache.spark.scheduler.Task.run(Task.scala:136)  at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)  at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750) {code}
> It seems like the code assumes that all schema must have properties, which 
> may not necessaily be true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7965) Clean up SchemaTestUtil code

2024-07-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7965:
-
Labels: pull-request-available  (was: )

> Clean up SchemaTestUtil code
> 
>
> Key: HUDI-7965
> URL: https://issues.apache.org/jira/browse/HUDI-7965
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: bradley
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7963) Avoid generating RLI records when disabled w/ MDT

2024-07-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7963:
-
Labels: pull-request-available  (was: )

> Avoid generating RLI records when disabled w/ MDT
> -
>
> Key: HUDI-7963
> URL: https://issues.apache.org/jira/browse/HUDI-7963
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7961) Optimize UpsertPartitioner for prepped write operations

2024-07-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7961:
-
Labels: pull-request-available  (was: )

> Optimize UpsertPartitioner for prepped write operations
> ---
>
> Key: HUDI-7961
> URL: https://issues.apache.org/jira/browse/HUDI-7961
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> We have avg record size calculation etc in UpsertPartitioner which does not 
> makes sense for prepped write operations. also, w/ MDT, we can optimize 
> these. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7958) Create partition stats index for all columns when no columns specified

2024-07-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7958:
-
Labels: pull-request-available  (was: )

> Create partition stats index for all columns when no columns specified
> --
>
> Key: HUDI-7958
> URL: https://issues.apache.org/jira/browse/HUDI-7958
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Just like column stats index, we can create partition stats index for all 
> column if no columns configured by the user.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7957) data skew when writing with bulk_insert + bucket_index enabled

2024-07-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7957:
-
Labels: pull-request-available  (was: )

> data skew when writing with bulk_insert + bucket_index enabled
> --
>
> Key: HUDI-7957
> URL: https://issues.apache.org/jira/browse/HUDI-7957
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
>  Labels: pull-request-available
>
> as  [https://github.com/apache/hudi/issues/11565] say, when use bulk insert 
> as row if table is bucket, data will skew, because of the partitioner 
> algorithm



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7955) Account for WritableTimestampObjectInspector#getPrimitiveJavaObject Hive3 and Hive2 discrepancies

2024-07-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7955:
-
Labels: pull-request-available  (was: )

> Account for WritableTimestampObjectInspector#getPrimitiveJavaObject Hive3 and 
> Hive2 discrepancies
> -
>
> Key: HUDI-7955
> URL: https://issues.apache.org/jira/browse/HUDI-7955
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: voon
>Assignee: voon
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-07-05-18-11-33-420.png, 
> image-2024-07-05-18-13-28-135.png
>
>
> The invocation of *getPrimitiveJavaObject* returns a different implementation 
> of timestamp in Hive3 and Hive2. 
>  - Hive2: *java.sql.Timestamp*
>  - Hive3: *org.apache.hadoop.hive.common.type.Timestamp*
> Hudi common is compiled with Hive2, but Trino is using Hive3, causing the 
> discrepancy between compile and runtime. When execution flow falls into this 
> section of the code where the trigger conditions are listed below:
> 1. MOR table is used
> 2. User is querying the _rt table
> 3. User's table has a *TIMESTAMP* type and query requires it
> 4. Merge is required as record is present in both Parquet and Log file
> Error below will be thrown:
> {code:java}
> Query 20240704_075218_05052_yfmfc failed: 'java.sql.Timestamp 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(java.lang.Object)'
> java.lang.NoSuchMethodError: 'java.sql.Timestamp 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(java.lang.Object)'
>         at 
> org.apache.hudi.hadoop.utils.HiveAvroSerializer.serializePrimitive(HiveAvroSerializer.java:304)
>         at 
> org.apache.hudi.hadoop.utils.HiveAvroSerializer.serialize(HiveAvroSerializer.java:212)
>         at 
> org.apache.hudi.hadoop.utils.HiveAvroSerializer.setUpRecordFieldFromWritable(HiveAvroSerializer.java:121)
>         at 
> org.apache.hudi.hadoop.utils.HiveAvroSerializer.serialize(HiveAvroSerializer.java:108)
>         at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.convertArrayWritableToHoodieRecord(RealtimeCompactedRecordReader.java:185)
>         at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.mergeRecord(RealtimeCompactedRecordReader.java:172)
>         at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:114)
>         at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:49)
>         at 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:88)
>         at 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36)
>         at 
> io.trino.plugin.hive.GenericHiveRecordCursor.advanceNextPosition(GenericHiveRecordCursor.java:215)
>         at 
> io.trino.spi.connector.RecordPageSource.getNextPage(RecordPageSource.java:88)
>         at 
> io.trino.plugin.hudi.HudiPageSource.getNextPage(HudiPageSource.java:120){code}
> h1. Hive3
> !image-2024-07-05-18-11-33-420.png|width=509,height=572!
> h1. Hive2
> !image-2024-07-05-18-13-28-135.png|width=507,height=501!
>  
> h1. How to reproduce
>  
>  
> {code:java}
> CREATE TABLE dev_hudi.hudi_7955__hive3_timestamp_issue (
>     id INT,
>     name STRING,
>     timestamp_col TIMESTAMP,
>     grass_region STRING
> ) USING hudi
> PARTITIONED BY (grass_region)
> tblproperties (
>     primaryKey = 'id',
>     type = 'mor',
>     precombineField = 'id',
>     hoodie.index.type = 'BUCKET',
>     hoodie.index.bucket.engine = 'CONSISTENT_HASHING',
>     hoodie.compact.inline = 'true'
> )
> LOCATION 'hdfs://path/to/hudi_tables/hudi_7955__hive3_timestamp_issue';
> -- 5 separate commits to trigger compaction
> INSERT INTO dev_hudi.hudi_7955__hive3_timestamp_issue VALUES (1, 'alex1', 
> now(), 'SG');
> -- No error here as there no MERGE is required between Parquet + Log
> SELECT _hoodie_file_name, id, timestamp_col FROM 
> dev_hudi.hudi_7955__hive3_timestamp_issue_rt WHERE _hoodie_file_name NOT LIKE 
> '%parquet%';
> INSERT INTO dev_hudi.hudi_7955__hive3_timestamp_issue VALUES (2, 'alex2', 
> now(), 'SG');
> INSERT INTO dev_hudi.hudi_7955__hive3_timestamp_issue VALUES (3, 'alex3', 
> now(), 'SG');
> INSERT INTO dev_hudi.hudi_7955__hive3

[jira] [Updated] (HUDI-7954) Fix data skipping with secondary index when there are no log files

2024-07-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7954:
-
Labels: pull-request-available  (was: )

> Fix data skipping with secondary index when there are no log files
> --
>
> Key: HUDI-7954
> URL: https://issues.apache.org/jira/browse/HUDI-7954
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0-beta2, 1.0.0
>
>
> When there are no log files in index, then the lookup returns no secondary 
> keys or candidate files, because of a bug - `logRecordsMap` is empty in this 
> code and base file records are ignored - 
> [https://github.com/apache/hudi/blob/70f44efe298771fcef9d029820a9b431e1ff165c/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java#L970]
> Even though current tests for pruning asserts the filtered files count < 
> total data files count. It is weak in the sense that it does not filtered 
> files count > 0, and hence the assertion passed even when filtered files 
> count = 0. Ultimately, all files were getting scanned. We should fix this 
> behavior



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7953) Improved the variable naming and formatting of HoodieActiveTimeline and HoodieIndex

2024-07-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7953:
-
Labels: pull-request-available  (was: )

> Improved the variable naming and formatting of HoodieActiveTimeline and 
> HoodieIndex
> ---
>
> Key: HUDI-7953
> URL: https://issues.apache.org/jira/browse/HUDI-7953
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: bradley
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6510) Java 17 compile time support

2024-07-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6510:
-
Labels: pull-request-available  (was: )

> Java 17 compile time support
> 
>
> Key: HUDI-6510
> URL: https://issues.apache.org/jira/browse/HUDI-6510
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Udit Mehrotra
>Assignee: Shawn Chang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Certify Hudi with Java 17 compile time support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7929) Add Flink Hudi Example for K8s

2024-07-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7929:
-
Labels: pull-request-available  (was: )

> Add Flink Hudi Example for K8s
> --
>
> Key: HUDI-7929
> URL: https://issues.apache.org/jira/browse/HUDI-7929
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink
>Reporter: Zhenqiu Huang
>Assignee: Zhenqiu Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7949) insert into hudi table with columns specified(reordered and not in table schema order) throws exception

2024-07-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7949:
-
Labels: pull-request-available  (was: )

> insert into hudi table with columns specified(reordered and not in table 
> schema order) throws exception
> ---
>
> Key: HUDI-7949
> URL: https://issues.apache.org/jira/browse/HUDI-7949
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark-sql
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
>  Labels: pull-request-available
>
> https://github.com/apache/hudi/issues/11552



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7937) Fix handling of decimals in StreamSync and Clustering

2024-07-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7937:
-
Labels: pull-request-available  (was: )

> Fix handling of decimals in StreamSync and Clustering
> -
>
> Key: HUDI-7937
> URL: https://issues.apache.org/jira/browse/HUDI-7937
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
>
> When decimals are using a small precision, we need to write them in legacy 
> format to ensure all hudi components can read them back. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7951) Classes using avro causing conflict in hudi-aws-bundle

2024-07-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7951:
-
Labels: pull-request-available  (was: )

> Classes using avro causing conflict in hudi-aws-bundle
> --
>
> Key: HUDI-7951
> URL: https://issues.apache.org/jira/browse/HUDI-7951
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Shawn Chang
>Priority: Major
>  Labels: pull-request-available
>
> Hudi 0.15 added some Hudi classes with avro usages 
> (ParquetTableSchemaResolver in this case), also had hudi-aws-bundle depend on 
> hudi-hadoop-common. hudi-aws-bundle won't relocate avro classes to be 
> compatible with hudi-spark.
>  
> The issue would happen when using hudi-flink-bundle with hudi-aws-bundle. 
> hudi-flink-bundle has relocated avro classes and would cause class conflict:
> {code:java}
> java.lang.NoSuchMethodError: 'org.apache.parquet.schema.MessageType 
> org.apache.hudi.common.table.ParquetTableSchemaResolver.convertAvroSchemaToParquet(org.apache.hudi.org.apache.avro.Schema,
>  org.apache.hadoop.conf.Configuration)'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7950) Shade roaring bitmap dependency in root POM

2024-07-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7950:
-
Labels: pull-request-available  (was: )

> Shade roaring bitmap dependency in root POM
> ---
>
> Key: HUDI-7950
> URL: https://issues.apache.org/jira/browse/HUDI-7950
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0-beta2, 1.0.0, 0.15.1
>
>
> We should unify the shading rule of roaring bitmap dependency in the root POM 
> for consistency among bundles.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7941) add show_file_status procedure

2024-07-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7941:
-
Labels: pull-request-available  (was: )

> add show_file_status procedure
> --
>
> Key: HUDI-7941
> URL: https://issues.apache.org/jira/browse/HUDI-7941
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: 陈磊
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0
>
>
> When incrementally consuming the hudi table or performing clustering or 
> compaction operations on the hudi table, it is often found that a certain 
> file does not exist. If you want to know which operation deleted the file, it 
> is a very troublesome operation. For this purpose, we provide a tool 
> `show_file_status` to view whether a specified file has been deleted and what 
> actions have been taken to delete it.
> usage:
> call show_file_status(table => '$tableName', partition => '$partition', file 
> => '$fileName')
> call show_file_status(table => '$tableName', file => '$fileName')
> output:
> 1)the file was deleted by the restore action
> +---+---+-++-+
> |status |action |instant  |timeline|full_path|
> +---+---+-++-+
> |deleted|restore|20240629225539880|active  | |
> +---+---+-++-+
> 2)the file has been deleted in other ways, such as hdfs dfs -rm
> +---+--+---++-+
> |status |action|instant|timeline|full_path|
> +---+--+---++-+
> |unknown|  |   || |
> +---+--+---++-+
> 3) the file exists
> +--+--+---++---+
> |status|action|instant|timeline|full_path 
>   
>|
> +--+--+---++---+
> |exist |  |   |active  
> |/Users/xx/xx/others/data/hudi-warehouse/source1/hudi_mor_append/sex=0/85ad0f44-22bf-4733-99bf-06382d6eacd5-0_0-130-89_20240629230123162.parquet|
> +--+--+---++---+



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7948) RFC-80: Support column families for wide tables

2024-07-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7948:
-
Labels: pull-request-available  (was: )

> RFC-80: Support column families for wide tables
> ---
>
> Key: HUDI-7948
> URL: https://issues.apache.org/jira/browse/HUDI-7948
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vova Kolmakov
>Assignee: Vova Kolmakov
>Priority: Major
>  Labels: pull-request-available
>
> Write, discuss, approve RFC document in github



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7943) Resolve version conflict of fasterxml on spark3.2

2024-07-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7943:
-
Labels: pull-request-available  (was: )

> Resolve version conflict of fasterxml on spark3.2 
> --
>
> Key: HUDI-7943
> URL: https://issues.apache.org/jira/browse/HUDI-7943
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dependencies
> Environment: hudi0.14.1, Spark3.2
>Reporter: Jihwan Lee
>Priority: Major
>  Labels: pull-request-available
>
> When run streaming read on spark3.2, raise exception that requires correct 
> version of jackson databind.
> Spark versions except 3.2 seem to use versions related to Spark dependencies.
>  
> version refer: https://github.com/apache/spark/blob/v3.2.3/pom.xml#L170
>  
> example code:
>  
> {code:java}
> import scala.collection.JavaConversions._
> import org.apache.spark.sql.SaveMode._
> import org.apache.hudi.DataSourceReadOptions._
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.hudi.common.table.HoodieTableConfig._
> import org.apache.hudi.config.HoodieWriteConfig._
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.common.model.HoodieRecord
> import spark.implicits._
> val basePath = "hdfs:///tmp/trips_table"
> spark.readStream
> .format("hudi")
> .option("hoodie.datasource.query.type", "incremental")
> .option("hoodie.datasource.query.incremental.format", "cdc")
> .load(basePath)
> .writeStream
> .format("console")
> .option("checkpointLocation", "/tmp/trips_table_checkpoint")
> .outputMode("append")
> .start().awaitTermination()
> {code}
>  
>  
> error log:
>  
> {code:java}
> Caused by: java.lang.ExceptionInInitializerError: 
> com.fasterxml.jackson.databind.JsonMappingException: Scala module 2.10.0 
> requires Jackson Databind version >= 2.10.0 and < 2.11.0
>   at 
> org.apache.spark.sql.hudi.streaming.HoodieSourceOffset.(HoodieSourceOffset.scala:30)
>   at 
> org.apache.spark.sql.hudi.streaming.HoodieStreamSource.getLatestOffset(HoodieStreamSource.scala:127)
>   at 
> org.apache.spark.sql.hudi.streaming.HoodieStreamSource.getOffset(HoodieStreamSource.scala:138)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$6(MicroBatchExecution.scala:403)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:375)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:373)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:69)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:402)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:384)
>   at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:627)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:380)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:210)
>   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:375)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:373)
>   at 
> org.apache.spark.sql.execution.streaming

[jira] [Updated] (HUDI-7883) Ensure 1.x commit instants are readable w/ 0.16.0

2024-07-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7883:
-
Labels: pull-request-available  (was: )

> Ensure 1.x commit instants are readable w/ 0.16.0 
> --
>
> Key: HUDI-7883
> URL: https://issues.apache.org/jira/browse/HUDI-7883
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: sivabalan narayanan
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>
> Ensure 1.x commit instants are readable w/ 0.16.0 reader.
>  
> May be we need to migrate HoodieInstant parsing logic to 0.16.0 in a 
> backwards compatible manner. or its already ported. we just need to write 
> tests and validate. 
> [https://github.com/apache/hudi/pull/9617] - contains some portion 
> (HoodieInstant changes and some method renames)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7945) Fix file pruning using PARTITION_STATS index in Spark

2024-07-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7945:
-
Labels: pull-request-available  (was: )

> Fix file pruning using PARTITION_STATS index in Spark
> -
>
> Key: HUDI-7945
> URL: https://issues.apache.org/jira/browse/HUDI-7945
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0-beta2, 1.0.0
>
>
> The issue can be reproduced by 
> [https://github.com/apache/hudi/pull/11472#issuecomment-2199332859.]
> When there are more than one base files in a table partition, the 
> corresponding PARTITION_STATS index record in the metadata table contains 
> null as the file_path field in HoodieColumnRangeMetadata.
> {code:java}
> private static > HoodieColumnRangeMetadata 
> mergeRanges(HoodieColumnRangeMetadata one,
>   
> HoodieColumnRangeMetadata another) {
>   
> ValidationUtils.checkArgument(one.getColumnName().equals(another.getColumnName()),
>   "Column names should be the same for merging column ranges");
>   final T minValue = getMinValueForColumnRanges(one, another);
>   final T maxValue = getMaxValueForColumnRanges(one, another);
>   return HoodieColumnRangeMetadata.create(
>   null, one.getColumnName(), minValue, maxValue,
>   one.getNullCount() + another.getNullCount(),
>   one.getValueCount() + another.getValueCount(),
>   one.getTotalSize() + another.getTotalSize(),
>   one.getTotalUncompressedSize() + another.getTotalUncompressedSize());
> } 
> {code}
> The null causes NPE when loading the column stats per partition from 
> PARTITION_STATS index.  Also, current implementation of 
> PartitionStatsIndexSupport assumes that the file_path field contains the 
> exact file name and it does not work if the the file path does not contain 
> null (even a list of file names stored does not work).  We have to 
> reimplement PartitionStatsIndexSupport so that it gives the pruned partitions 
> for further processing.
> {code:java}
> Caused by: java.lang.NullPointerException: element cannot be mapped to a null 
> key
>     at java.util.Objects.requireNonNull(Objects.java:228)
>     at java.util.stream.Collectors.lambda$groupingBy$45(Collectors.java:907)
>     at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
>     at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>     at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
>     at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
>     at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>     at java.util.Iterator.forEachRemaining(Iterator.java:116)
>     at 
> java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
>     at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
>     at 
> java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272)
>     at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
>     at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>     at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>     at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747)
>     at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721)
>     at java.util.stream.AbstractTask.compute(AbstractTask.java:327)
>     at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
>     at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
>     at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401)
>     at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
>     at 
> java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714)
>     at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
>     at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
>     at 
> org.apache.hudi.common.data.HoodieListPairData.groupByKey(HoodieListPairData.java:115)
>     at 
> org.apache.hudi.ColumnStatsIndexSupport.transpose(ColumnStatsIndexSupport.scala:253)
>     at 
> org.apache.hudi.ColumnStatsIndexSupport.$anonfun$loadTransposed$1(ColumnStatsIndexSupport.scala:149)
>     at 
> org.apache.hudi.Ho

[jira] [Updated] (HUDI-7940) Pass metrics to ErrorTableWriter to be able to emit metrics for Error Table

2024-07-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7940:
-
Labels: pull-request-available  (was: )

> Pass metrics to ErrorTableWriter to be able to emit metrics for Error Table
> ---
>
> Key: HUDI-7940
> URL: https://issues.apache.org/jira/browse/HUDI-7940
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Rajesh Mahindra
>Assignee: Rajesh Mahindra
>Priority: Minor
>  Labels: pull-request-available
>
> Pass metrics to ErrorTableWriter to be able to emit metrics for Error Table



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0

2024-07-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7882:
-
Labels: pull-request-available  (was: )

> Umbrella ticket to track all changes required to support reading 1.x tables 
> with 0.16.0 
> 
>
> Key: HUDI-7882
> URL: https://issues.apache.org/jira/browse/HUDI-7882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
> umbrella ticket to track all of them.
>  
> RFC in progress: [https://github.com/apache/hudi/pull/11514] 
>  
> Changes required to be ported: 
> 0. Creating 0.16.0 branch
> 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 
>  
> 1. Timeline 
> 1.a Hoodie instant parsing should be able to read 1.x instants. 
> https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 
> 1.b Commit metadata parsing is able to handle both json and avro formats. 
> Scope might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  
> Siva.
> 1.c HoodieDefaultTimeline able to read both timelines based on table version. 
>  https://issues.apache.org/jira/browse/HUDI-7884 Siva.
> 1.d Reading LSM timeline using 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7890 Siva. 
> 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901
>  
> 2. Table property changes 
> 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
> https://issues.apache.org/jira/browse/HUDI-7865 LJ
>  
> 3. MDT table changes
> 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ
> 3.b MDT payload schema changes. 
> https://issues.apache.org/jira/browse/HUDI-7886 LJ
>  
> 4. Log format changes
> 4.a All metadata header types porting 
> https://issues.apache.org/jira/browse/HUDI-7887 Jon
> 4.b Meaningful error for incompatible features from 1.x 
> https://issues.apache.org/jira/browse/HUDI-7888 Jon
>  
> 5. Log file slice or grouping detection compatibility 
>  
> 5. Tests 
> 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 
>  
> 6 Doc changes 
> 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
> https://issues.apache.org/jira/browse/HUDI-7889 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7905) Use cluster action for clustering pending instants

2024-07-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7905:
-
Labels: pull-request-available  (was: )

> Use cluster action for clustering pending instants
> --
>
> Key: HUDI-7905
> URL: https://issues.apache.org/jira/browse/HUDI-7905
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Currently, we use replacecommit for clustering, insert overwrite and delete 
> partition. Clustering should be a separate action for requested and inflight 
> instant. This simplifies a few things such as we do not need to scan the 
> replacecommit.requested to determine whether we are looking at clustering 
> plan or not. This would simplify the usage of pending clustering related 
> APIs. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7859) Rename instant files to be consistent with 0.x naming format

2024-07-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7859:
-
Labels: pull-request-available  (was: )

> Rename instant files to be consistent with 0.x naming format
> 
>
> Key: HUDI-7859
> URL: https://issues.apache.org/jira/browse/HUDI-7859
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: YangXuan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Needed for downgrade



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7915) Spark 4 support

2024-07-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7915:
-
Labels: pull-request-available  (was: )

> Spark 4 support
> ---
>
> Key: HUDI-7915
> URL: https://issues.apache.org/jira/browse/HUDI-7915
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Shawn Chang
>Priority: Major
>  Labels: pull-request-available
>
> Spark 4.0.0-preview1 is out.  We should start integrating Hudi with Spark 4 
> and surface any issues early on.
> https://spark.apache.org/news/spark-4.0.0-preview1.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4822) Extract the baseFile and logFIles from HoodieDeltaWriteStat in the right way

2024-07-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4822:
-
Labels: pull-request-available  (was: )

> Extract the baseFile and logFIles from HoodieDeltaWriteStat in the right way
> 
>
> Key: HUDI-4822
> URL: https://issues.apache.org/jira/browse/HUDI-4822
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Yann Byron
>Assignee: Vova Kolmakov
>Priority: Major
>  Labels: pull-request-available
>
> currently, we can't get the `baseFile` and `logFiles` members from 
> `HoodieDeltaWriteStat` directly. That's because it lost the related 
> information after deserialization from the commit files. So we need to 
> improve this.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7903) Partition Stats Index not getting created with SQL

2024-07-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7903:
-
Labels: pull-request-available  (was: )

> Partition Stats Index not getting created with SQL
> --
>
> Key: HUDI-7903
> URL: https://issues.apache.org/jira/browse/HUDI-7903
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0-beta2, 1.0.0
>
>
> {code:java}
> spark.sql(
>   s"""
>  | create table $tableName using hudi
>  | partitioned by (dt)
>  | tblproperties(
>  |primaryKey = 'id',
>  |preCombineField = 'ts',
>  |'hoodie.metadata.index.partition.stats.enable' = 'true'
>  | )
>  | location '$tablePath'
>  | AS
>  | select 1 as id, 'a1' as name, 10 as price, 1000 as ts, 
> cast('2021-05-06' as date) as dt
>""".stripMargin
> ) {code}
> Even when partition stats is enabled, index is not created with SQL. Works 
> for datasource.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7926) dataskipping failure mode should be strict in test

2024-06-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7926:
-
Labels: pull-request-available  (was: )

> dataskipping failure mode should be strict in test
> --
>
> Key: HUDI-7926
> URL: https://issues.apache.org/jira/browse/HUDI-7926
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Critical
>  Labels: pull-request-available
>
> dataskipping failure mode should be strict in test. if use fallback mode 
> default, the query ut is meaningless.
> There may be other codes that have been introduced into bugs but cannot be 
> measured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7709) Class Cast Exception while reading the data using TimestampBasedKeyGenerator

2024-06-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7709:
-
Labels: pull-request-available  (was: )

> Class Cast Exception while reading the data using TimestampBasedKeyGenerator
> 
>
> Key: HUDI-7709
> URL: https://issues.apache.org/jira/browse/HUDI-7709
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Github Issue - [https://github.com/apache/hudi/issues/11140]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7924) Capture Latency and Failure Metrics For Hive Table recreation

2024-06-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7924:
-
Labels: pull-request-available  (was: )

> Capture Latency and Failure Metrics For Hive Table recreation
> -
>
> Key: HUDI-7924
> URL: https://issues.apache.org/jira/browse/HUDI-7924
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vamsi Karnika
>Priority: Major
>  Labels: pull-request-available
>
> As part of recreating the glue and hive table whenever sync schema or 
> partition fails, we want to capture and push metrics related to latency(time 
> taken to recreate and sync the table) and a failure metric(when recreating 
> the table fails). * Push Latency metric to capture time taken to recreate and 
> sync the table
>  * Push a failure metric if recreate and sync fails.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7922) Add Hudi CLI bundle for Scala 2.13

2024-06-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7922:
-
Labels: pull-request-available  (was: )

> Add Hudi CLI bundle for Scala 2.13
> --
>
> Key: HUDI-7922
> URL: https://issues.apache.org/jira/browse/HUDI-7922
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Build of Hudi CLI bundle should succeed on Scala 2.13 and work on Spark 3.5 
> and Scala 2.13.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7921) Chase down memory leaks in Writeclient with MDT enabled

2024-06-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7921:
-
Labels: pull-request-available  (was: )

> Chase down memory leaks in Writeclient with MDT enabled
> ---
>
> Key: HUDI-7921
> URL: https://issues.apache.org/jira/browse/HUDI-7921
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> We see OOMs when deltastreamer is running continuously for days together. We 
> suspect some memory leaks when metadata table is enabled. Lets try to chase 
> down all of them and fix it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7911) Enable cdc log for MOR table

2024-06-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7911:
-
Labels: pull-request-available  (was: )

> Enable cdc log for MOR table
> 
>
> Key: HUDI-7911
> URL: https://issues.apache.org/jira/browse/HUDI-7911
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7920) Make Spark 3.5 the default build profile for Spark integration

2024-06-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7920:
-
Labels: pull-request-available  (was: )

> Make Spark 3.5 the default build profile for Spark integration
> --
>
> Key: HUDI-7920
> URL: https://issues.apache.org/jira/browse/HUDI-7920
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Currently, Spark 3.2 is the default build profile.  Given Spark 3.2 is no 
> longer actively maintained (latest Spark 3.2.x release i from April 2023), we 
> should upgrade the default build profile on Spark to Spark 3.5 to maintain 
> the support on latest Spark release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7914) Incorrect schema produced in DELETE_PARTITION replacecommit

2024-06-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7914:
-
Labels: pull-request-available  (was: )

> Incorrect schema produced in DELETE_PARTITION replacecommit
> ---
>
> Key: HUDI-7914
> URL: https://issues.apache.org/jira/browse/HUDI-7914
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vitali Makarevich
>Priority: Major
>  Labels: pull-request-available
>
> in the current scenario delete_partitions produces {{replacecommit}} with 
> internal fields - like {{{}_hoodie_file_name{}}}, while e.g. normal 
> {{commit}} produces schema without such fields.
> This leads to unexpected behavior when the {{replacecommit}} is the last on 
> the commitline,
> e.g. [#10258|https://github.com/apache/hudi/issues/10258]
> [#10533|https://github.com/apache/hudi/issues/10533]
> and e.g. metadata sync things, or any other potential write will take 
> incorrect schema - and in the best case will fail because fields are 
> duplicated, in the worst cases can lead to dataloss.
> The problem introduced here [https://github.com/apache/hudi/pull/5610/files]
> And for other operations like {{delete}} the same approach used as I use now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7909) Add Comment to the FieldSchema returned by Aws Glue Client

2024-06-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7909:
-
Labels: pull-request-available  (was: )

> Add Comment to the FieldSchema returned by Aws Glue Client 
> ---
>
> Key: HUDI-7909
> URL: https://issues.apache.org/jira/browse/HUDI-7909
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vamsi Karnika
>Priority: Major
>  Labels: pull-request-available
>
> The Implementation of getMetastoreFieldSchema by AwsGlueCatalogSyncClient 
> doesn't included comment as part of the FieldSchema. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7906) improve the parallelism deduce in rdd write

2024-06-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7906:
-
Labels: pull-request-available  (was: )

> improve the parallelism deduce in rdd write
> ---
>
> Key: HUDI-7906
> URL: https://issues.apache.org/jira/browse/HUDI-7906
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
>  Labels: pull-request-available
>
> as [https://github.com/apache/hudi/issues/11274] and 
> [https://github.com/apache/hudi/pull/11463] describe, there has two case 
> question.
>  # if the rdd is input rdd without shuffle, the partitiion number is too 
> bigger or too small
>  # user need can not control it easy
>  ## in some case user can set `spark.default.parallelism` change it.
>  ## in some case user can not change because hard-code
>  ## and in spark, the better way is use `spark.default.parallelism` or 
> `spark.sql.shuffle.partitions` can control it, other is advanced in hudi.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7877) Add record position to record index metadata payload

2024-06-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7877:
-
Labels: pull-request-available  (was: )

> Add record position to record index metadata payload
> 
>
> Key: HUDI-7877
> URL: https://issues.apache.org/jira/browse/HUDI-7877
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> RLI should save the record position so that can be used in the index lookup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7892) Building workload support set parallelism

2024-06-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7892:
-
Labels: pull-request-available  (was: )

> Building workload support set parallelism
> -
>
> Key: HUDI-7892
> URL: https://issues.apache.org/jira/browse/HUDI-7892
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
>
> Building workload support set parallelism



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7891) Fix HoodieActiveTimeline#deleteCompletedRollback missing check for Action type

2024-06-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7891:
-
Labels: pull-request-available  (was: )

> Fix HoodieActiveTimeline#deleteCompletedRollback missing check for Action type
> --
>
> Key: HUDI-7891
> URL: https://issues.apache.org/jira/browse/HUDI-7891
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: bradley
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7881) Handle table base path changes in meta syncs.

2024-06-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7881:
-
Labels: pull-request-available  (was: )

> Handle table base path changes in meta syncs.
> -
>
> Key: HUDI-7881
> URL: https://issues.apache.org/jira/browse/HUDI-7881
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer, meta-sync
>Reporter: Vinish Reddy
>Assignee: Vinish Reddy
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7880) Support extraMetadata in Spark SQL Insert Into

2024-06-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7880:
-
Labels: pull-request-available  (was: )

> Support extraMetadata in Spark SQL Insert Into
> --
>
> Key: HUDI-7880
> URL: https://issues.apache.org/jira/browse/HUDI-7880
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Users want to implement checkpoints similar to those in Hudi DeltaStreamer. 
> DeltaStreamer is implemented by saving values to extrametadata in a commit 
> file, with the key deltastreamer.checkpoint.key. We can achieve this in Spark 
> Client by configuring the parameter `house. datasource. write. commonmeta. 
> key. prefix`, but in Spark SQL, it is restricted that the prefix of the 
> configuration parameter must be `hoodie.`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7879) Optimize the redundant creation of HoodieTable in DataSourceInternalWriterHelper and the unnecessary parameters in createTable within BaseHoodieWriteClient.

2024-06-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7879:
-
Labels: pull-request-available  (was: )

> Optimize the redundant creation of HoodieTable in 
> DataSourceInternalWriterHelper and the unnecessary parameters in createTable 
> within BaseHoodieWriteClient.
> 
>
> Key: HUDI-7879
> URL: https://issues.apache.org/jira/browse/HUDI-7879
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ma Jian
>Priority: Major
>  Labels: pull-request-available
>
> In the initialization method of DataSourceInternalWriterHelper, it currently 
> creates two identical HoodieTable instances. We should remove one of them. 
> Also, when comparing the differences between the two HoodieTable instances, I 
> noticed that the createTable method in BaseHoodieWriteClient includes a 
> HadoopConfiguration parameter that isn't used by any implemented methods. I'm 
> not sure why it was designed this way, but I think we can remove it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7876) Use TypedProperties to store the spillable map configs for the FG reader

2024-06-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7876:
-
Labels: pull-request-available  (was: )

> Use TypedProperties to store the spillable map configs for the FG reader
> 
>
> Key: HUDI-7876
> URL: https://issues.apache.org/jira/browse/HUDI-7876
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> This takes up 4 params for the fg reader that can just be stored in the 
> TypedProperties that is already passed in.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7874) Fail to read 2-level structure Parquet

2024-06-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7874:
-
Labels: pull-request-available  (was: )

> Fail to read 2-level structure Parquet
> --
>
> Key: HUDI-7874
> URL: https://issues.apache.org/jira/browse/HUDI-7874
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vitali Makarevich
>Priority: Major
>  Labels: pull-request-available
>
> If I have {{"spark.hadoop.parquet.avro.write-old-list-structure", "false"}} 
> explicitly set - to being able to write nulls inside arrays(the only way), 
> Hudi starts to write Parquets with the following schema inside:
>  {{   required group internal_list (LIST) \{
> repeated group list {
>   required int64 element;
> }
>   }}}
>  
> But if I had some files produced before setting 
> {{{}"spark.hadoop.parquet.avro.write-old-list-structure", "false"{}}}, they 
> have the following schema inside
>  {{  required group internal_list (LIST) \{
> repeated int64 array;
>   }}}
>  
> And Hudi 0.14.x at least fails to read records from such file - failing with 
> exception
> {{Caused by: java.lang.RuntimeException: Null-value for required field: }}
> Even though the contents of arrays is {{{}not null{}}}(it cannot be null in 
> fact since Avro requires 
> {{spark.hadoop.parquet.avro.write-old-list-structure}} = {{false}} to write 
> {{{}null{}}}s.
> h3. Expected behavior
> Taken from Hudi 0.12.1(not sure what exactly broke that):
>  # If I have a file with 2 level structure and update(not matter having nulls 
> inside array or not - both produce the same) arrives with 
> "spark.hadoop.parquet.avro.write-old-list-structure", "false" - overwrite it 
> into 3 level.({*}fails in 0.14.1{*})
>  # If I have 3 level structure with nulls and update cames(not matter with 
> nulls or without) - read and write correctly
> The simple reproduction of issue can be found here:
> [https://github.com/VitoMakarevich/hudi-issue-014]
> Highly likely, the problem appeared after Hudi made some changes, so values 
> from Hadoop conf started to propagate into Reader instance(likely they were 
> not propagated before).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7875) Remove tablePath from HoodieFileGroupReader

2024-06-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7875:
-
Labels: pull-request-available  (was: )

> Remove tablePath from HoodieFileGroupReader
> ---
>
> Key: HUDI-7875
> URL: https://issues.apache.org/jira/browse/HUDI-7875
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> tablePath is stored in the metaclient which is a param.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7873) Remove getStorage method from HoodieReaderContext

2024-06-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7873:
-
Labels: pull-request-available  (was: )

> Remove getStorage method from HoodieReaderContext
> -
>
> Key: HUDI-7873
> URL: https://issues.apache.org/jira/browse/HUDI-7873
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> All implementations of the method were the same, and it was only used by a 
> test method becuase storage is passed as a param to the fg reader.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7872) Recreate Glue table on certain types of exceptions

2024-06-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7872:
-
Labels: pull-request-available  (was: )

> Recreate Glue table on certain types of exceptions
> --
>
> Key: HUDI-7872
> URL: https://issues.apache.org/jira/browse/HUDI-7872
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vamsi Karnika
>Priority: Major
>  Labels: pull-request-available
>
> If there are certain types of exceptions (schema changes, unable to add 
> partitions) re-create the Glue table so that the table continues to be 
> queryable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7871) Remove tableconfig from HoodieFilegroupReader params

2024-06-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7871:
-
Labels: pull-request-available  (was: )

> Remove tableconfig from HoodieFilegroupReader params
> 
>
> Key: HUDI-7871
> URL: https://issues.apache.org/jira/browse/HUDI-7871
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> In prod usages, we just get the tableconfigs from the metaclient. The 
> constructor has too many params so getting rid of one will be useful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7867) Data deduplication caused by drawback in the delete invalid files before commit

2024-06-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7867:
-
Labels: pull-request-available  (was: )

> Data deduplication caused by drawback in the delete invalid files before 
> commit
> ---
>
> Key: HUDI-7867
> URL: https://issues.apache.org/jira/browse/HUDI-7867
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: Jing Zhang
>Priority: Major
>  Labels: pull-request-available
>
> Our user complained that after their daily run job which written to a Hudi 
> cow table finished, the downstream reading jobs find many duplicate records 
> today. The daily run job has been already online for a long time, and this is 
> the first time of such wrong result.
> He gives a detailed deduplicated record as example to help debug. The record 
> appeared in 3 base files which belongs to different file groups.
> [!https://private-user-images.githubusercontent.com/1525333/337907952-60b95dc4-91d6-4b40-8bca-c877a4407ae0.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkwNzk1Mi02MGI5NWRjNC05MWQ2LTRiNDAtOGJjYS1jODc3YTQ0MDdhZTAucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZTZhMThjZDdiNjNmYjYyZmU5Mjg3OWIyMTg5ZTFkNDBmMTc5NjliZjFjMjQwZWQwM2JjZjMxNDU4ZDA3NzZhZSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.ueqsTezXNbtnxyqSyzW2_v92Jc0z_7ioljutPcfcWwE|width=491!|https://private-user-images.githubusercontent.com/1525333/337907952-60b95dc4-91d6-4b40-8bca-c877a4407ae0.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkwNzk1Mi02MGI5NWRjNC05MWQ2LTRiNDAtOGJjYS1jODc3YTQ0MDdhZTAucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZTZhMThjZDdiNjNmYjYyZmU5Mjg3OWIyMTg5ZTFkNDBmMTc5NjliZjFjMjQwZWQwM2JjZjMxNDU4ZDA3NzZhZSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.ueqsTezXNbtnxyqSyzW2_v92Jc0z_7ioljutPcfcWwE]
> I find the today's writer job, the spark application finished successfully.
> In the driver log, I find those two files marked as invalid files which to 
> delete, only one file is valid files.
> [!https://private-user-images.githubusercontent.com/1525333/337909363-8e19e170-e38f-4725-82a5-84ed55750db9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkwOTM2My04ZTE5ZTE3MC1lMzhmLTQ3MjUtODJhNS04NGVkNTU3NTBkYjkucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NzUwMGQ4ODU2NDNmODFiYmE2YjA0OGIzMzBhZGU4OGMxOGYxMTNkZTJjNzZjZDI0N2YwNDRmMWMwY2ZiNWQzOSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.0RruG5y4012v6dHdoqmEEMTT2oLWjmIHQsa_JHl-vmg|width=1380!|https://private-user-images.githubusercontent.com/1525333/337909363-8e19e170-e38f-4725-82a5-84ed55750db9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkwOTM2My04ZTE5ZTE3MC1lMzhmLTQ3MjUtODJhNS04NGVkNTU3NTBkYjkucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NzUwMGQ4ODU2NDNmODFiYmE2YjA0OGIzMzBhZGU4OGMxOGYxMTNkZTJjNzZjZDI0N2YwNDRmMWMwY2ZiNWQzOSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.0RruG5y4012v6dHdoqmEEMTT2oLWjmIHQsa_JHl-vmg]
> And in the clean stage task log, those two files are also marked to be 
> deleted and there is no exception in the task either.
> [!https://private-user-images.githubusercontent.com/1525333/337911404-

[jira] [Updated] (HUDI-7838) Use Config hoodie.schema.cache.enable in HoodieBaseFileGroupRecordBuffer and AbstractHoodieLogRecordReader

2024-06-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7838:
-
Labels: pull-request-available  (was: )

> Use Config hoodie.schema.cache.enable in HoodieBaseFileGroupRecordBuffer and  
> AbstractHoodieLogRecordReader
> ---
>
> Key: HUDI-7838
> URL: https://issues.apache.org/jira/browse/HUDI-7838
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: Jonathan Vexler
>Assignee: Vova Kolmakov
>Priority: Major
>  Labels: pull-request-available
>
> hoodie.schema.cache.enable should be used to decide if we want to use the 
> schema cache. Currently it is hardcoded to false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7671) Make Hudi timeline backward compatible

2024-06-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7671:
-
Labels: compatibility pull-request-available  (was: compatibility)

> Make Hudi timeline backward compatible
> --
>
> Key: HUDI-7671
> URL: https://issues.apache.org/jira/browse/HUDI-7671
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: compatibility, pull-request-available
> Fix For: 1.0.0
>
>
> Since release 1.x, the timeline metadata file name is changed to include the 
> completion time, we need to keep compatibility for 0.x branches/releases.
> 0.x meta file name pattern: ${instant_time}.action[.state]
> 1.x meta file name pattern: ${instant_time}_${completion_time}.action[.state].
> In 1.x release, while decipheriing the Hudi instant from the metadata files, 
> if there is no completion time, uses the file modification time as the 
> completion time instead.
> The modification time follows the OCC concurrency control semantics if the 
> files were not moved around.
> Caution that if the table is a MOR table and the files got moved in history 
> from old folder to the current folder, the reader view may represent wong 
> result set because the completion time are completely the same for all the 
> alive instants.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7869) Ensure properties are copied when modifying schema

2024-06-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7869:
-
Labels: pull-request-available  (was: )

> Ensure properties are copied when modifying schema
> --
>
> Key: HUDI-7869
> URL: https://issues.apache.org/jira/browse/HUDI-7869
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Properties are not always copied when we modify the schema, such as removing 
> fields.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-06-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7779:
-
Labels: pull-request-available  (was: )

> Guarding archival to not archive unintended commits
> ---
>
> Key: HUDI-7779
> URL: https://issues.apache.org/jira/browse/HUDI-7779
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: archiving
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> Archiving commits from active timeline could lead to data consistency issues 
> on some rarest of occasions. We should come up with proper guards to ensure 
> we do not make such unintended archival. 
>  
> Major gap which we wanted to guard is:
> if someone disabled cleaner, archival should account for data consistency 
> issues and ensure it bails out.
> We have a base guarding condition, where archival will stop at the earliest 
> commit to retain based on latest clean commit metadata. But there are few 
> other scenarios that needs to be accounted for. 
>  
> a. Keeping aside replace commits, lets dive into specifics for regular 
> commits and delta commits.
> Say user configured clean commits to 4 and archival configs to 5 and 6. after 
> t10, cleaner is supposed to clean up all file versions created at or before 
> t6. Say cleaner did not run(for whatever reason for next 5 commits). 
>     Archival will certainly be guarded until earliest commit to retain based 
> on latest clean commits. 
> Corner case to consider: 
> A savepoint was added to say t3 and later removed. and still the cleaner was 
> never re-enabled. Even though archival would have been stopped at t3 (until 
> savepoint is present),but once savepoint is removed, if archival is executed, 
> it could archive commit t3. Which means, file versions tracked at t3 is still 
> not yet cleaned by the cleaner. 
> Reasoning: 
> We are good here wrt data consistency. Up until cleaner runs next time, this 
> older file versions might be exposed to the end-user. But time travel query 
> is not intended for already cleaned up commits and hence this is not an 
> issue. None of snapshot, time travel query or incremental query will run into 
> issues as they are not supposed to poll for t3. 
> At any later point, if cleaner is re-enabled, it will take care of cleaning 
> up file versions tracked at t3 commit. Just that for interim period, some 
> older file versions might still be exposed to readers. 
>  
> b. The more tricky part is when replace commits are involved. Since replace 
> commit metadata in active timeline is what ensures the replaced file groups 
> are ignored for reads, before archiving the same, cleaner is expected to 
> clean them up fully. But are there chances that this could go wrong? 
> Corner case to consider. Lets add onto above scenario, where t3 has a 
> savepoint, and t4 is a replace commit which replaced file groups tracked in 
> t3. 
> Cleaner will skip cleaning up files tracked by t3(due to the presence of 
> savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain 
> will be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
> disabled. In this state of the timeline, if archival is executed, (since 
> t3.savepoint is removed), archival might archive t3 and t4.rc.  This could 
> lead to data duplicates as both replaced file groups and new file groups from 
> t4.rc would be exposed as valid file groups. 
>  
> In other words, if we were to summarize the different scenarios: 
> i. replaced file group is never cleaned up. 
>     - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
> ii. replaced file group is cleaned up. 
>     - ECTR is > this.rc and is good to archive.
> iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
> clean up did not happen.  After savepoint is removed, and when archival is 
> executed, we should avoid archiving the rc of interest. This is the gap we 
> don't account for as of now.
>  
> We have 3 options to go about to solve this.
> Option A: 
> Let Savepoint deletion flow take care of cleaning up the files its tracking. 
> cons:
> Savepoint's responsibility is not removing any data files. So, from a single 
> user responsibility rule, this may not be right. Also, this clean up might 
> need to do what a clean planner might actually be doing. ie. build file 
> system view, understand if its supposed to be cleaned up already,

[jira] [Updated] (HUDI-7847) Infer record merge mode during table upgrade

2024-06-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7847:
-
Labels: pull-request-available  (was: )

> Infer record merge mode during table upgrade
> 
>
> Key: HUDI-7847
> URL: https://issues.apache.org/jira/browse/HUDI-7847
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Geser Dugarov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Record merge mode is required to dictate the merging behavior in release 1.x, 
> playing the same role as the payload class config in the release 0.x.  During 
> table upgrade, we need to infer the record merge mode based on the payload 
> class so it's correctly set.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7841) RLI and secondary index should consider only pruned partitions for file skipping

2024-06-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7841:
-
Labels: pull-request-available  (was: )

> RLI and secondary index should consider only pruned partitions for file 
> skipping
> 
>
> Key: HUDI-7841
> URL: https://issues.apache.org/jira/browse/HUDI-7841
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Even though RLI scans only matching files, it tries to get those candidate 
> files by iterating over all files from file index. See - 
> [https://github.com/apache/hudi/blob/f4be74c29471fbd6afff472f8db292e6b1f16f05/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala#L47]
> Instead, it can use the `prunedPartitionsAndFileSlices` to only consider 
> pruned partitions whenever there is a partition predicate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7855) Add ability to dynamically configure write parallelism for BULK_INSERT for HoodieStreamer

2024-06-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7855:
-
Labels: pull-request-available  (was: )

> Add ability to dynamically configure write parallelism for BULK_INSERT for 
> HoodieStreamer
> -
>
> Key: HUDI-7855
> URL: https://issues.apache.org/jira/browse/HUDI-7855
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Rajesh Mahindra
>Assignee: Rajesh Mahindra
>Priority: Major
>  Labels: pull-request-available
>
> Add ability to dynamically configure write parallelism for BULK_INSERT for 
> HoodieStreamer. Currently, BULK_INSERT parallelism to configured based on 
> source parallelism that may be aggressive or conservative depending on other 
> factors, e.g. partitions written to etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7854) Bump AWS SDK v2 version to 2.25.69

2024-06-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7854:
-
Labels: pull-request-available  (was: )

> Bump AWS SDK v2 version to 2.25.69
> --
>
> Key: HUDI-7854
> URL: https://issues.apache.org/jira/browse/HUDI-7854
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> The current version of AWS SDK v2 used is 2.18.40 which is 1.5 years old.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7853) Fix missing serDe properties post migration from hiveSync to glueSync

2024-06-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7853:
-
Labels: pull-request-available  (was: )

> Fix missing serDe properties post migration from hiveSync to glueSync
> -
>
> Key: HUDI-7853
> URL: https://issues.apache.org/jira/browse/HUDI-7853
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prathit Malik
>Assignee: Prathit Malik
>Priority: Major
>  Labels: pull-request-available
>
> More info : [https://github.com/apache/hudi/issues/11397]
>  
> After migration to 0.13.1, hudi table path is missing from serde properties 
> due to which when reading from spark below error is thrown
> - org.apache.hudi.exception.HoodieException: 'path' or 'Key: 
> 'hoodie.datasource.read.paths' , default: null description: Comma separated 
> list of file paths to read within a Hudi table. since version: version is not 
> defined deprecated after: version is not defined)' or both must be specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7852) Constrain the comparison of different types of ordering values to limited cases

2024-06-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7852:
-
Labels: pull-request-available  (was: )

> Constrain the comparison of different types of ordering values to limited 
> cases
> ---
>
> Key: HUDI-7852
> URL: https://issues.apache.org/jira/browse/HUDI-7852
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> HoodieBaseFileGroupRecordBuffer#compareTo compares the numbers by casting 
> them to the long value, which may not be safe for Float and Double.  We 
> should limit the allowed cases to avoid wrong results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7849) Reduce time spent on running testFiltersInFileFormat

2024-06-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7849:
-
Labels: pull-request-available  (was: )

> Reduce time spent on running testFiltersInFileFormat
> 
>
> Key: HUDI-7849
> URL: https://issues.apache.org/jira/browse/HUDI-7849
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Vova Kolmakov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Below shows the top long-running tests in the job "UT flink & FT common & 
> flink & spark-client & hudi-spark" in Azure CI.  The time running 
> testFiltersInFileFormat should be reduced.
> {code:java}
> /usr/bin/bash --noprofile --norc 
> /home/vsts/work/_temp/4fa77791-00bc-40cc-82d7-1fb635914a0f.sh
> grep: */target/surefire-reports/*.xml: No such file or directory
> 366.474 boolean) [2] false(testFiltersInFileFormat
> 223.221 boolean) [1] true(testFiltersInFileFormat
> 80.903 HoodieTableType, Integer) [3] MERGE_ON_READ, 2(testNewParquetFileFormat
> 65.48 boolean) [2] true(testDeletePartitionAndArchive
> 56.558 boolean) [1] false(testDeletePartitionAndArchive{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7851) Fix java doc of DeltaWriteProfile

2024-06-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7851:
-
Labels: pull-request-available  (was: )

> Fix java doc of DeltaWriteProfile
> -
>
> Key: HUDI-7851
> URL: https://issues.apache.org/jira/browse/HUDI-7851
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: bradley
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7846) Bump apache-rat-plugin to 0.16.1 to eliminate thread-safe warning in maven parallel build

2024-06-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7846:
-
Labels: pull-request-available  (was: )

> Bump apache-rat-plugin to 0.16.1 to eliminate thread-safe warning in maven 
> parallel build
> -
>
> Key: HUDI-7846
> URL: https://issues.apache.org/jira/browse/HUDI-7846
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> The following error is thrown when doing maven parallel build with `mvn -T 1C 
> ...`
> {code:java}
> [WARNING] Enable debug to see precisely which goals are not marked as 
> thread-safe.
> [WARNING] *
> [WARNING] * Your build is requesting parallel execution, but this         *
> [WARNING] * project contains the following plugin(s) that have goals not  *
> [WARNING] * marked as thread-safe to support parallel execution.          *
> [WARNING] * While this /may/ work fine, please look for plugin updates    *
> [WARNING] * and/or request plugins be made thread-safe.                   *
> [WARNING] * If reporting an issue, report it against the plugin in        *
> [WARNING] * question, not against Apache Maven.                           *
> [WARNING] *
> [WARNING] The following plugins are not marked as thread-safe in 
> hudi-hadoop-mr:
> [WARNING]   org.apache.rat:apache-rat-plugin:0.13 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7845) Call show_fsview_latest Procedure support path_regex

2024-06-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7845:
-
Labels: pull-request-available  (was: )

> Call show_fsview_latest Procedure support path_regex
> 
>
> Key: HUDI-7845
> URL: https://issues.apache.org/jira/browse/HUDI-7845
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Xinyu Zou
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7844) Fix HoodieSparkSqlTestBase to throw error upon test failure

2024-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7844:
-
Labels: pull-request-available  (was: )

> Fix HoodieSparkSqlTestBase to throw error upon test failure
> ---
>
> Key: HUDI-7844
> URL: https://issues.apache.org/jira/browse/HUDI-7844
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
> Attachments: Screenshot 2024-06-07 at 22.27.21.png
>
>
> This PR ([https://github.com/apache/hudi/pull/11162]) introduces the 
> following changes that make `HoodieSparkSqlTestBase` to swallow test failures.
>  
> !Screenshot 2024-06-07 at 22.27.21.png|width=873,height=397!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7390) [Regression] HoodieStreamer no longer works without --props being supplied

2024-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7390:
-
Labels: pull-request-available  (was: )

> [Regression] HoodieStreamer no longer works without --props being supplied
> --
>
> Key: HUDI-7390
> URL: https://issues.apache.org/jira/browse/HUDI-7390
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Affects Versions: 1.0.0-beta1, 0.14.1
>Reporter: Brandon Dahler
>Assignee: Vova Kolmakov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
> Attachments: spark.log
>
>
> h2. Problem
> When attempting to run HoodieStreamer without a props file, specifying all 
> required extra configuration via {{--hoodie-conf}} parameters, the execution 
> fails and an exception is thrown:
> {code:java}
> 24/02/06 22:15:13 INFO SparkContext: Successfully stopped SparkContext
> Exception in thread "main" org.apache.hudi.exception.HoodieIOException: 
> Cannot read properties from dfs from file 
> file:/private/tmp/hudi-props-repro/src/test/resources/streamer-config/dfs-source.properties
>         at 
> org.apache.hudi.common.config.DFSPropertiesConfiguration.addPropsFromFile(DFSPropertiesConfiguration.java:166)
>         at 
> org.apache.hudi.common.config.DFSPropertiesConfiguration.(DFSPropertiesConfiguration.java:85)
>         at 
> org.apache.hudi.utilities.UtilHelpers.readConfig(UtilHelpers.java:232)
>         at 
> org.apache.hudi.utilities.streamer.HoodieStreamer$Config.getProps(HoodieStreamer.java:437)
>         at 
> org.apache.hudi.utilities.streamer.StreamSync.getDeducedSchemaProvider(StreamSync.java:656)
>         at 
> org.apache.hudi.utilities.streamer.StreamSync.fetchNextBatchFromSource(StreamSync.java:632)
>         at 
> org.apache.hudi.utilities.streamer.StreamSync.fetchFromSourceAndPrepareRecords(StreamSync.java:525)
>         at 
> org.apache.hudi.utilities.streamer.StreamSync.readFromSource(StreamSync.java:498)
>         at 
> org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:404)
>         at 
> org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.ingestOnce(HoodieStreamer.java:850)
>         at 
> org.apache.hudi.utilities.ingestion.HoodieIngestionService.startIngestion(HoodieIngestionService.java:72)
>         at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
>         at 
> org.apache.hudi.utilities.streamer.HoodieStreamer.sync(HoodieStreamer.java:207)
>         at 
> org.apache.hudi.utilities.streamer.HoodieStreamer.main(HoodieStreamer.java:592)
>         at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
>         at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>         at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>         at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1020)
>         at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
>         at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
>         at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
>         at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:)
>         at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120)
>         at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.io.FileNotFoundException: File 
> file:/private/tmp/hudi-props-repro/src/test/resources/streamer-config/dfs-source.properties
>  does not exist
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
>         at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:160)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:372)
>    

[jira] [Updated] (HUDI-7840) Add position merging back to file group reader

2024-06-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7840:
-
Labels: pull-request-available  (was: )

> Add position merging back to file group reader
> --
>
> Key: HUDI-7840
> URL: https://issues.apache.org/jira/browse/HUDI-7840
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> Was removed to make change to fg reader but will now be added back with 
> proper fix



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7834) Setup table versions to differentiate HUDI 0.16.x and 1.0-beta versions

2024-06-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7834:
-
Labels: pull-request-available  (was: )

> Setup table versions to differentiate HUDI 0.16.x and 1.0-beta versions
> ---
>
> Key: HUDI-7834
> URL: https://issues.apache.org/jira/browse/HUDI-7834
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7830) Use predicate when calculating snapshot checkpoints.

2024-06-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7830:
-
Labels: pull-request-available source  (was: source)

> Use predicate when calculating snapshot checkpoints.
> 
>
> Key: HUDI-7830
> URL: https://issues.apache.org/jira/browse/HUDI-7830
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Vinish Reddy
>Assignee: Vinish Reddy
>Priority: Minor
>  Labels: pull-request-available, source
>
> Currently only startInstant and endInstant are calculated for snapshot 
> checkpoints, including a filter parameter as well which can be used to 
> effectively prune data. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7414) Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs

2024-06-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7414:
-
Labels: pull-request-available  (was: )

> Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs
> ---
>
> Key: HUDI-7414
> URL: https://issues.apache.org/jira/browse/HUDI-7414
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: nadine
>Assignee: Shiyan Xu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> There was a jira issue filed where sarfaraz wanted to know more about the 
> `hoodie.gcp.bigquery.sync.base_path`.  
> In the BigQuerySyncConfig file, there a config property set: 
> [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java#L103]
>   But it’s not used anywhere else in the big query code base.
> However, I see
> [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java#L124]
>  being used to get the base path. The {{hoodie.gcp.bigquery.sync.base_path}}  
> is superfluous. I’m seeing as a config being set, but not being used 
> anywhere. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7828) Support Flink 1.18.1

2024-06-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7828:
-
Labels: pull-request-available  (was: )

> Support Flink 1.18.1
> 
>
> Key: HUDI-7828
> URL: https://issues.apache.org/jira/browse/HUDI-7828
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Shawn Chang
>Priority: Major
>  Labels: pull-request-available
>
> Current Hudi supports Flink 1.18.0, and we need to bump the Flink 1.18 
> version to 1.18.1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7782) Task not serializable due to DynamoDBBasedLockProvider and HiveMetastoreBasedLockProvider in clean action

2024-06-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7782:
-
Labels: pull-request-available  (was: )

> Task not serializable due to DynamoDBBasedLockProvider and 
> HiveMetastoreBasedLockProvider in clean action
> -
>
> Key: HUDI-7782
> URL: https://issues.apache.org/jira/browse/HUDI-7782
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: hector
>Assignee: Vova Kolmakov
>Priority: Major
>  Labels: pull-request-available
>
> Caused by: java.io.NotSerializableException: 
> org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider
> Serialization stack:
>  - object not serializable (class: 
> org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider, value: 
> org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider@1355d2ca)
>  
> like HUDI-3638, only fixed the issue of ZookeeperbasedLockProvider.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch dependabot/maven/io.airlift-aircompressor-0.27 deleted (was 5042e73eb65)

2024-06-03 Thread github-bot
This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a change to branch 
dependabot/maven/io.airlift-aircompressor-0.27
in repository https://gitbox.apache.org/repos/asf/hudi.git


 was 5042e73eb65 Bump io.airlift:aircompressor from 0.25 to 0.27

The revisions that were on this branch are still contained in
other references; therefore, this change does not discard any commits
from the repository.



[jira] [Updated] (HUDI-7747) In MetaClient remove getBasePathV2() and return StoragePath from getBasePath()

2024-06-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7747:
-
Labels: pull-request-available  (was: )

> In MetaClient remove getBasePathV2() and return StoragePath from getBasePath()
> --
>
> Key: HUDI-7747
> URL: https://issues.apache.org/jira/browse/HUDI-7747
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Vova Kolmakov
>Priority: Major
>  Labels: pull-request-available
>
> In HoodieTableMetaClient remove getBasePathV2() and return StoragePath from 
> getBasePath().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7826) hoodie.write.set.null.for.missing.columns results in invalid objects

2024-06-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7826:
-
Labels: pull-request-available  (was: )

> hoodie.write.set.null.for.missing.columns results in invalid objects
> 
>
> Key: HUDI-7826
> URL: https://issues.apache.org/jira/browse/HUDI-7826
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
>
> When setting `hoodie.write.set.null.for.missing.columns` a null value will 
> get set for the fields missing in the incoming data set. If the column was 
> non-nullable, then you will get an error at runtime. Instead, we should 
> evolve the field to be nullable in the table's schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch dependabot/maven/io.airlift-aircompressor-0.27 created (now 5042e73eb65)

2024-06-02 Thread github-bot
This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a change to branch 
dependabot/maven/io.airlift-aircompressor-0.27
in repository https://gitbox.apache.org/repos/asf/hudi.git


  at 5042e73eb65 Bump io.airlift:aircompressor from 0.25 to 0.27

No new revisions were added by this update.



[jira] [Updated] (HUDI-7825) Support Report pending clustering and compaction plan metric

2024-06-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7825:
-
Labels: pull-request-available  (was: )

> Support Report pending clustering and compaction plan metric 
> -
>
> Key: HUDI-7825
> URL: https://issues.apache.org/jira/browse/HUDI-7825
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: jack Lei
>Priority: Major
>  Labels: pull-request-available
>
> 1、when just async clustering or async compaction schedule enable, and 
> clustering.async.enabled or compaction.async.enabled  set false, then the 
> flink job will not add clusterPlanOperator or  CompactionPlanOperator
> 2、 but the pending plan metric emit in clusterPlanOperator or 
> CompactionPlanOperator
> 3、so maybe support emit pending plan metric in StreamWriteOperatorCoordinator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7824) Fix incremental partitions fetch logic when savepoint is removed for Incr cleaner

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7824:
-
Labels: pull-request-available  (was: )

> Fix incremental partitions fetch logic when savepoint is removed for Incr 
> cleaner
> -
>
> Key: HUDI-7824
> URL: https://issues.apache.org/jira/browse/HUDI-7824
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> with incremental cleaner, if a savepoint is blocking cleaning up of a commit 
> and cleaner moved ahead wrt earliest commit to retain, when savepoint is 
> removed later, cleaner should account for cleaning up the commit of interest. 
>  
> Lets ensure clean planner account for all partitions when such savepoint 
> removal is detected



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7823) Simplify dependency management on exclusions

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7823:
-
Labels: pull-request-available  (was: )

> Simplify dependency management on exclusions
> 
>
> Key: HUDI-7823
> URL: https://issues.apache.org/jira/browse/HUDI-7823
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7822) Resolve the conflicts between mixed hdfs and local path in Flink tests

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7822:
-
Labels: pull-request-available  (was: )

> Resolve the conflicts between mixed hdfs and local path in Flink tests
> --
>
> Key: HUDI-7822
> URL: https://issues.apache.org/jira/browse/HUDI-7822
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7821) Handle schema evolution in proto to avro conversion

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7821:
-
Labels: pull-request-available  (was: )

> Handle schema evolution in proto to avro conversion
> ---
>
> Key: HUDI-7821
> URL: https://issues.apache.org/jira/browse/HUDI-7821
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
>
> Users can encounter errors when a batch of data was written with an older 
> schema and a new schema has fields that are not present in the old data



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7819) Fix OptionsResolver#allowCommitOnEmptyBatch default value bug

2024-05-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7819:
-
Labels: pull-request-available  (was: )

> Fix OptionsResolver#allowCommitOnEmptyBatch default value bug
> -
>
> Key: HUDI-7819
> URL: https://issues.apache.org/jira/browse/HUDI-7819
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: bradley
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding

2024-05-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7817:
-
Labels: pull-request-available  (was: )

> Use Jackson Core instead of org.codehaus.jackson for JSON encoding
> --
>
> Key: HUDI-7817
> URL: https://issues.apache.org/jira/browse/HUDI-7817
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> org.codehaus.jackson is a older version of Jackson Core 
> (com.fasterxml.jackson.core:jackson-core).  
> org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which 
> should be avoided.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7816) Pass the source profile to the snapshot query splitter

2024-05-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7816:
-
Labels: pull-request-available  (was: )

> Pass the source profile to the snapshot query splitter
> --
>
> Key: HUDI-7816
> URL: https://issues.apache.org/jira/browse/HUDI-7816
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Rajesh Mahindra
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7815) Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh timeline

2024-05-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7815:
-
Labels: pull-request-available  (was: )

> Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh 
> timeline
> 
>
> Key: HUDI-7815
> URL: https://issues.apache.org/jira/browse/HUDI-7815
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >