[jira] [Closed] (HUDI-6153) Change the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason closed HUDI-6153.

Fix Version/s: 0.14.0
   (was: 0.14.1)
   Resolution: Fixed

> Change the rollback mechanism for MDT to actual rollbacks rather than 
> appending revert blocks
> -
>
> Key: HUDI-6153
> URL: https://issues.apache.org/jira/browse/HUDI-6153
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> When rolling back completed commits for indexes like record-index, the list 
> of all keys removed from the dataset is required. This information cannot be 
> available during rollback processing in MDT since the files have already been 
> deleted during the rollback inflight processing. 
> Hence, the current MDT rollback mechanism of adding -files, -col_stats 
> entries does not work for record index.
> This PR changes the rollback mechanism to actually rollback deltacommits on 
> the MDT. This makes the rollback handing faster and keeps the MDT in sync 
> with dataset.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1887) Make schema post processor's default as disabled

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-1887:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Make schema post processor's default as disabled
> 
>
> Key: HUDI-1887
> URL: https://issues.apache.org/jira/browse/HUDI-1887
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: core-flow-ds, pull-request-available, sev:high, triaged
> Fix For: 0.14.1
>
>
> With default value [fix|https://github.com/apache/hudi/pull/2765], schema 
> post processor is not required as mandatory. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5552) Too slow while using trino-hudi connector while querying partitioned tables.

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-5552:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Too slow while using trino-hudi connector while querying partitioned tables.
> 
>
> Key: HUDI-5552
> URL: https://issues.apache.org/jira/browse/HUDI-5552
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: trino-presto
>Reporter: Danny Chen
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.14.1
>
>
> See the issue for details: [[SUPPORT] Too slow while using trino-hudi 
> connector while querying partitioned tables. · Issue #7643 · apache/hudi 
> (github.com)|https://github.com/apache/hudi/issues/7643]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4123:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",  

[jira] [Updated] (HUDI-5490) Investigate test failures w/ record level index for existing tests

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-5490:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Investigate test failures w/ record level index for existing tests
> --
>
> Key: HUDI-5490
> URL: https://issues.apache.org/jira/browse/HUDI-5490
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Lokesh Jain
>Priority: Blocker
> Fix For: 0.14.1
>
>
> Enable record level index for some of the chosen tests (30 to 40) and ensure 
> they succeed. The parameterized tests covered in the jira are.
> TestCOWDataSourceStorage, TestSparkDataSource, TestMORDataSourceStorage, 
> TestCOWDataSource#testDropInsertDup and 
> TestHoodieClientOnCopyOnWriteStorage (Sub tests below for 
> TestHoodieClientOnCopyOnWriteStorage)
> Auto commit tests
> testDeduplicationOnInsert
> testDeduplicationOnUpsert
> testInsertsWithHoodieConcatHandle
> testDeletes
> testUpsertsUpdatePartitionPathGlobalBloom
> testSmallInsertHandlingForUpserts
> testSmallInsertHandlingForInserts
> testDeletesWithDeleteApi



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3026) HoodieAppendhandle may result in duplicate key for hbase index

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-3026:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> HoodieAppendhandle may result in duplicate key for hbase index
> --
>
> Key: HUDI-3026
> URL: https://issues.apache.org/jira/browse/HUDI-3026
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: ZiyueGuan
>Assignee: ZiyueGuan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Problem: a same key may occur in two file group when Hbase index is used. 
> These two file group will have same FileID prefix. As Hbase index is global, 
> this is unexpected
> How to repro:
> We should have a table w/o record sorted in spark. Let's say we have five 
> records with key 1,2,3,4,5 to write. They may be iterated in different order. 
> In the first attempt 1, we write three records 5,4,3 to 
> fileID_1_log.1_attempt1. But this attempt failed. Spark will have a try in 
> the second task attempt (attempt 2), we write four records 1,2,3,4 to  
> fileID_1_log.1_attempt2. And then, we find this filegroup is large enough by 
> call canWrite. So hudi write record 5 to fileID_2_log.1_attempt2 and finish 
> this commit.
> When we do compaction, fileID_1_log.1_attempt1 and fileID_1_log.1_attempt2 
> will be compacted. And we finally got 543 + 1234 = 12345 in fileID_1 while we 
> also got 5 in fileID_2. Record 5 will appear in two fileGroup.
> Reason: Markerfile doesn't reconcile log file as code show in  
> [https://github.com/apache/hudi/blob/9a2030ab3190acf600ce4820be9a08929595763e/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java#L553.]
> And log file is actually not fail-safe.
> I'm not sure if [~danny0405] have found this problem too as I find 
> FlinkAppendHandle had been made to always return true. But it was just 
> changed back recently. 
> Solution:
> We may have a quick fix by making canWrite in HoodieAppendHandle always 
> return true. However, I think there may be a more elegant solution that we 
> use append result to generate compaction plan rather than list log file, in 
> which we will have a more granular control on log block instead of log file. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4701) Support bulk insert without primary key and precombine field

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4701:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Support bulk insert without primary key and precombine field
> 
>
> Key: HUDI-4701
> URL: https://issues.apache.org/jira/browse/HUDI-4701
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Critical
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4585) Optimize query performance on Presto Hudi connector

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4585:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Optimize  query performance on Presto Hudi connector
> 
>
> Key: HUDI-4585
> URL: https://issues.apache.org/jira/browse/HUDI-4585
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6163) Add PR size labeler to Hudi repo

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-6163:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Add PR size labeler to Hudi repo
> 
>
> Key: HUDI-6163
> URL: https://issues.apache.org/jira/browse/HUDI-6163
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4294) Introduce build action to actually perform index data generation

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4294:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Introduce build action to actually perform index data generation
> 
>
> Key: HUDI-4294
> URL: https://issues.apache.org/jira/browse/HUDI-4294
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: shibei
>Assignee: shibei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> In this issue, we introduce a new action type called build to actually 
> perform index data generation. This action contains two steps as clustering 
> action does:
>  # Generate action plan to clarify which files and which indexes need to be 
> built;
>  # Execute build index according action plan generated by step one;
>  
> Call procedure will be implemented as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6139) Add support for Transformer schema validation in DeltaStreamer

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-6139:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Add support for Transformer schema validation in DeltaStreamer
> --
>
> Key: HUDI-6139
> URL: https://issues.apache.org/jira/browse/HUDI-6139
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Add a new API in Transformer to provide target schema after transformation. 
> The new API can then be used to validate if schema of transformed data 
> matches the expected schema of transformer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5323) Decouple virtual key with writing bloom filters to parquet files

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-5323:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Decouple virtual key with writing bloom filters to parquet files
> 
>
> Key: HUDI-5323
> URL: https://issues.apache.org/jira/browse/HUDI-5323
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: index, writer-core
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> When the virtual key feature is enabled by setting 
> hoodie.populate.meta.fields to false, the bloom filters are not written to 
> parquet base files in the write transactions.  Relevant logic in 
> HoodieFileWriterFactory class:
> {code:java}
> private static  
> HoodieFileWriter newParquetFileWriter(
> String instantTime, Path path, HoodieWriteConfig config, Schema schema, 
> HoodieTable hoodieTable,
> TaskContextSupplier taskContextSupplier, boolean populateMetaFields) 
> throws IOException {
>   return newParquetFileWriter(instantTime, path, config, schema, 
> hoodieTable.getHadoopConf(),
>   taskContextSupplier, populateMetaFields, populateMetaFields);
> }
> private static  
> HoodieFileWriter newParquetFileWriter(
> String instantTime, Path path, HoodieWriteConfig config, Schema schema, 
> Configuration conf,
> TaskContextSupplier taskContextSupplier, boolean populateMetaFields, 
> boolean enableBloomFilter) throws IOException {
>   Option filter = enableBloomFilter ? 
> Option.of(createBloomFilter(config)) : Option.empty();
>   HoodieAvroWriteSupport writeSupport = new HoodieAvroWriteSupport(new 
> AvroSchemaConverter(conf).convert(schema), schema, filter);
>   HoodieParquetConfig parquetConfig = new 
> HoodieParquetConfig<>(writeSupport, config.getParquetCompressionCodec(),
>   config.getParquetBlockSize(), config.getParquetPageSize(), 
> config.getParquetMaxFileSize(),
>   conf, config.getParquetCompressionRatio(), 
> config.parquetDictionaryEnabled());
>   return new HoodieAvroParquetWriter<>(path, parquetConfig, instantTime, 
> taskContextSupplier, populateMetaFields);
> } {code}
> Given that bloom filters are absent, when using Bloom Index on the same 
> table, the writer encounters NPE (HUDI-5319).
> We should decouple the virtual key feature with bloom filter and always write 
> the bloom filters to the parquet files. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2754) Performance improvement for IncrementalRelation

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-2754:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Performance improvement for IncrementalRelation
> ---
>
> Key: HUDI-2754
> URL: https://issues.apache.org/jira/browse/HUDI-2754
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: incremental-query, performance
>Reporter: Jintao
>Assignee: Jintao
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> When HoodieIncrSource is used to fetch the update from another Hudi table, 
> the IncrementalRelation will be used to read the data. But it has a 
> performance issue because the column pruning and predicate pushdown don't 
> happen. As the result, Hudi reads too much useless data.
> By enabling the column pruning and predicate pushdown, the data to read is 
> reduced dramatically.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-541) Replace variables/comments named "data files" to "base file"

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-541:

Fix Version/s: 0.14.1
   (was: 0.14.0)

> Replace variables/comments named "data files" to "base file"
> 
>
> Key: HUDI-541
> URL: https://issues.apache.org/jira/browse/HUDI-541
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, dev-experience
>Reporter: Vinoth Chandar
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: new-to-hudi, pull-request-available
> Fix For: 0.14.1
>
>
> Per cWiki design and arch page, we should converge on the same terminology.. 
> We have _HoodieBaseFile_.. we should ensure all variables of this type are 
> named _baseFile_ or _bf_ , as opposed to _dataFile_ or _df_. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3541) Array of struct or Struct of Array AvroConversion issue

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-3541:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Array of struct or Struct of Array AvroConversion issue 
> 
>
> Key: HUDI-3541
> URL: https://issues.apache.org/jira/browse/HUDI-3541
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4954) Shade avro in all bundles where it is included

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4954:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Shade avro in all bundles where it is included 
> ---
>
> Key: HUDI-4954
> URL: https://issues.apache.org/jira/browse/HUDI-4954
> Project: Apache Hudi
>  Issue Type: Task
>  Components: dependencies
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> [https://github.com/apache/hudi/issues/6829]
> Shading in some but not all bundles leads to class conflict if those bundles 
> are in classpath.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-96) Use Command line options instead of positional arguments when launching spark applications from various CLI commands

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-96?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-96:
---
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Use Command line options instead of positional arguments when launching spark 
> applications from various CLI commands
> 
>
> Key: HUDI-96
> URL: https://issues.apache.org/jira/browse/HUDI-96
> Project: Apache Hudi
>  Issue Type: Task
>  Components: cli
>Reporter: Balaji Varadarajan
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: new-to-hudi, newbie, pull-request-available, sev:normal
> Fix For: 0.14.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Hoodie CLI commands like compaction/rollback/repair/savepoints/parquet-import 
> relies on launching a spark application to perform their operations (look at 
> SparkMain.java). 
> SparkMain (Look at SparkMain.main()) relies on positional arguments for 
> passing  various CLI options. Instead we should define proper CLI options in 
> SparkMain and use them (using Jcommander)  to improve readability and avoid 
> accidental errors at call sites. For e.g : See 
> com.uber.hoodie.utilities.HoodieCompactor



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3676) Enhance tests for triggering clean every Nth commit

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-3676:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Enhance tests for triggering clean every Nth commit 
> 
>
> Key: HUDI-3676
> URL: https://issues.apache.org/jira/browse/HUDI-3676
> Project: Apache Hudi
>  Issue Type: Test
>  Components: cleaning, tests-ci
>Reporter: sivabalan narayanan
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> [PR-4385|https://github.com/apache/hudi/pull/4385]
> We need to enhance tests for this new feature. i.e. trigger clean every Nth 
> commit. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5423) Flaky test: ColumnStatsTestCase(MERGE_ON_READ,true,true)

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-5423:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Flaky test: ColumnStatsTestCase(MERGE_ON_READ,true,true)
> 
>
> Key: HUDI-5423
> URL: https://issues.apache.org/jira/browse/HUDI-5423
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: Raymond Xu
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.14.1
>
>
> {code}
> [ERROR] Tests run: 94, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 
> 1,729.267 s <<< FAILURE! - in JUnit Vintage
> [ERROR] [8] 
> ColumnStatsTestCase(MERGE_ON_READ,true,true)(testMetadataColumnStatsIndex(ColumnStatsTestCase))
>   Time elapsed: 23.246 s  <<< FAILURE!
> org.opentest4j.AssertionFailedError: 
> expected: 
> <{"c1_maxValue":101,"c1_minValue":101,"c1_nullCount":0,"c2_maxValue":" 
> 999sdc","c2_minValue":" 
> 999sdc","c2_nullCount":0,"c3_maxValue":10.329,"c3_minValue":10.329,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.179Z","c4_minValue":"2021-11-19T07:34:44.179Z","c4_nullCount":0,"c5_maxValue":99,"c5_minValue":99,"c5_nullCount":0,"c6_maxValue":"2020-03-28","c6_minValue":"2020-03-28","c6_nullCount":0,"c7_maxValue":"SA==","c7_minValue":"SA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":1}
> {"c1_maxValue":562,"c1_minValue":323,"c1_nullCount":0,"c2_maxValue":" 
> 984sdc","c2_minValue":" 
> 980sdc","c2_nullCount":0,"c3_maxValue":977.328,"c3_minValue":64.768,"c3_nullCount":1,"c4_maxValue":"2021-11-19T07:34:44.201Z","c4_minValue":"2021-11-19T07:34:44.181Z","c4_nullCount":0,"c5_maxValue":78,"c5_minValue":34,"c5_nullCount":0,"c6_maxValue":"2020-10-21","c6_minValue":"2020-01-15","c6_nullCount":0,"c7_maxValue":"SA==","c7_minValue":"qw==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":4}
> {"c1_maxValue":568,"c1_minValue":8,"c1_nullCount":0,"c2_maxValue":" 
> 8sdc","c2_minValue":" 
> 111sdc","c2_nullCount":0,"c3_maxValue":979.272,"c3_minValue":82.111,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.193Z","c4_minValue":"2021-11-19T07:34:44.159Z","c4_nullCount":0,"c5_maxValue":58,"c5_minValue":2,"c5_nullCount":0,"c6_maxValue":"2020-11-08","c6_minValue":"2020-01-01","c6_nullCount":0,"c7_maxValue":"9g==","c7_minValue":"Ag==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":15}
> {"c1_maxValue":619,"c1_minValue":619,"c1_nullCount":0,"c2_maxValue":" 
> 985sdc","c2_minValue":" 
> 985sdc","c2_nullCount":0,"c3_maxValue":230.320,"c3_minValue":230.320,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.180Z","c4_minValue":"2021-11-19T07:34:44.180Z","c4_nullCount":0,"c5_maxValue":33,"c5_minValue":33,"c5_nullCount":0,"c6_maxValue":"2020-02-13","c6_minValue":"2020-02-13","c6_nullCount":0,"c7_maxValue":"QA==","c7_minValue":"QA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":1}
> {"c1_maxValue":633,"c1_minValue":624,"c1_nullCount":0,"c2_maxValue":" 
> 987sdc","c2_minValue":" 
> 986sdc","c2_nullCount":0,"c3_maxValue":580.317,"c3_minValue":375.308,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.180Z","c4_minValue":"2021-11-19T07:34:44.180Z","c4_nullCount":0,"c5_maxValue":33,"c5_minValue":32,"c5_nullCount":0,"c6_maxValue":"2020-10-10","c6_minValue":"2020-01-01","c6_nullCount":0,"c7_maxValue":"PQ==","c7_minValue":"NA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":2}
> {"c1_maxValue":639,"c1_minValue":555,"c1_nullCount":0,"c2_maxValue":" 
> 989sdc","c2_minValue":" 
> 982sdc","c2_nullCount":0,"c3_maxValue":904.304,"c3_minValue":153.431,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.186Z","c4_minValue":"2021-11-19T07:34:44.179Z","c4_nullCount":0,"c5_maxValue":44,"c5_minValue":31,"c5_nullCount":0,"c6_maxValue":"2020-08-25","c6_minValue":"2020-03-12","c6_nullCount":0,"c7_maxValue":"MA==","c7_minValue":"rw==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":3}
> {"c1_maxValue":715,"c1_minValue":76,"c1_nullCount":0,"c2_maxValue":" 
> 76sdc","c2_minValue":" 
> 224sdc","c2_nullCount":0,"c3_maxValue":958.579,"c3_minValue":246.427,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.199Z","c4_minValue":"2021-11-19T07:34:44.166Z","c4_nullCount":0,"c5_maxValue":73,"c5_minValue":9,"c5_nullCount":0,"c6_maxValue":"2020-11-21","c6_minValue":"2020-01-16","c6_nullCount":0,"c7_maxValue":"+g==","c7_minValue":"LA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":12}
> {"c1_maxValue":768,"c1_minValue":59,"c1_nullCount":0,"c2_maxValue":" 
> 768sdc","c2_minValue":" 
> 

[jira] [Updated] (HUDI-3617) MOR compact improve

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-3617:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> MOR compact improve
> ---
>
> Key: HUDI-3617
> URL: https://issues.apache.org/jira/browse/HUDI-3617
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction, writer-core
>Reporter: scx
>Assignee: scx
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> In most business scenarios, the latest data is in the latest delta log file, 
> so we sort it from large to small according to the instance time, which can 
> largely avoid rewriting the data in the compact process, and then optimize 
> the compact time



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6317) Streaming read should skip compaction and clustering instants to avoid duplicates

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-6317:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Streaming read should skip compaction and clustering instants to avoid 
> duplicates
> -
>
> Key: HUDI-6317
> URL: https://issues.apache.org/jira/browse/HUDI-6317
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> At present, the default value of read.streaming.skip_clustering is false, 
> which could cause the situation that streaming reading reads the replaced 
> file slices of clustering, so that streaming reading may read T-1 day data 
> when clustering the data of T-1 day to cause duplicated data. Therefore 
> streaming read should skip clustering instants for all cases to avoid reading 
> the replaced file slices. Same to `read.streaming.skip_compaction`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1628) [Umbrella] Improve data locality during ingestion

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-1628:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)
   (was: 1.1.0)

> [Umbrella] Improve data locality during ingestion
> -
>
> Key: HUDI-1628
> URL: https://issues.apache.org/jira/browse/HUDI-1628
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: writer-core
>Reporter: satish
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 0.14.1
>
>
> Today the upsert partitioner does the file sizing/bin-packing etc for
> inserts and then sends some inserts over to existing file groups to
> maintain file size.
> We can abstract all of this into strategies and some kind of pipeline
> abstractions and have it also consider "affinity" to an existing file group
> based
> on say information stored in the metadata table?
> See http://mail-archives.apache.org/mod_mbox/hudi-dev/202102.mbox/browser
>  for more details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3555) re-use spark config for parquet timestamp format instead of having our own config

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-3555:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> re-use spark config for parquet timestamp format instead of having our own 
> config
> -
>
> Key: HUDI-3555
> URL: https://issues.apache.org/jira/browse/HUDI-3555
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: sivabalan narayanan
>Assignee: liujinhui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> We have two diff configs to set the right timestamp format. 
> "hoodie.parquet.outputtimestamptype": "TIMESTAMP_MICROS",
> and spark config
> --conf spark.sql.parquet.outputTimestampType=TIMESTAMP_MICROS 
>  
> We should deprecate our own config and just rely on spark's configs. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4854) Deltastreamer does not respect partition selector regex for metadata-only bootstrap

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4854:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Deltastreamer does not respect partition selector regex for metadata-only 
> bootstrap
> ---
>
> Key: HUDI-4854
> URL: https://issues.apache.org/jira/browse/HUDI-4854
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6105) Partial update for MERGE INTO

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-6105:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Partial update for MERGE INTO
> -
>
> Key: HUDI-6105
> URL: https://issues.apache.org/jira/browse/HUDI-6105
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark-sql
>Reporter: Danny Chen
>Assignee: Jing Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1864) Support for java.time.LocalDate in TimestampBasedAvroKeyGenerator

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-1864:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Support for java.time.LocalDate in TimestampBasedAvroKeyGenerator
> -
>
> Key: HUDI-1864
> URL: https://issues.apache.org/jira/browse/HUDI-1864
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vaibhav Sinha
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available, query-eng, sev:high
> Fix For: 0.14.1
>
>
> When we read data from MySQL which has a column of type {{Date}}, Spark 
> represents it as an instance of {{java.time.LocalDate}}. If I try and use 
> this column for partitioning while doing a write to Hudi, I get the following 
> exception
>  
> {code:java}
> Caused by: org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to 
> parse input partition field :2021-04-21
>   at 
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:136)
>  ~[hudi-spark3-bundle_2.12-0.8.0.jar:0.8.0]
>   at 
> org.apache.hudi.keygen.CustomAvroKeyGenerator.getPartitionPath(CustomAvroKeyGenerator.java:89)
>  ~[hudi-spark3-bundle_2.12-0.8.0.jar:0.8.0]
>   at 
> org.apache.hudi.keygen.CustomKeyGenerator.getPartitionPath(CustomKeyGenerator.java:64)
>  ~[hudi-spark3-bundle_2.12-0.8.0.jar:0.8.0]
>   at 
> org.apache.hudi.keygen.BaseKeyGenerator.getKey(BaseKeyGenerator.java:62) 
> ~[hudi-spark3-bundle_2.12-0.8.0.jar:0.8.0]
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$write$2(HoodieSparkSqlWriter.scala:160)
>  ~[hudi-spark3-bundle_2.12-0.8.0.jar:0.8.0]
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) 
> ~[scala-library-2.12.10.jar:?]
>   at scala.collection.Iterator$SliceIterator.next(Iterator.scala:271) 
> ~[scala-library-2.12.10.jar:?]
>   at scala.collection.Iterator.foreach(Iterator.scala:941) 
> ~[scala-library-2.12.10.jar:?]
>   at scala.collection.Iterator.foreach$(Iterator.scala:941) 
> ~[scala-library-2.12.10.jar:?]
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) 
> ~[scala-library-2.12.10.jar:?]
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) 
> ~[scala-library-2.12.10.jar:?]
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) 
> ~[scala-library-2.12.10.jar:?]
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) 
> ~[scala-library-2.12.10.jar:?]
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) 
> ~[scala-library-2.12.10.jar:?]
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315) 
> ~[scala-library-2.12.10.jar:?]
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313) 
> ~[scala-library-2.12.10.jar:?]
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429) 
> ~[scala-library-2.12.10.jar:?]
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307) 
> ~[scala-library-2.12.10.jar:?]
>   at 
> scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307) 
> ~[scala-library-2.12.10.jar:?]
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429) 
> ~[scala-library-2.12.10.jar:?]
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294) 
> ~[scala-library-2.12.10.jar:?]
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288) 
> ~[scala-library-2.12.10.jar:?]
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429) 
> ~[scala-library-2.12.10.jar:?]
>   at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1449) 
> ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2242) 
> ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
> ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at org.apache.spark.scheduler.Task.run(Task.scala:131) 
> ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>  ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) 
> ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) 
> ~[spark-core_2.12-3.1.1.jar:3.1.1]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  ~[?:1.8.0_171]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  ~[?:1.8.0_171]
>   at java.lang.Thread.run(Thread.java:748) 

[jira] [Updated] (HUDI-5464) Fix instantiation of a new partition in MDT re-using the same instant time as a regular commit

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-5464:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Fix instantiation of a new partition in MDT re-using the same instant time as 
> a regular commit
> --
>
> Key: HUDI-5464
> URL: https://issues.apache.org/jira/browse/HUDI-5464
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> we re-use the same instant time as the commit being applied to MDT while 
> instantiating a new partition in MDT. this needs to be fixed. 
>  
> for eg, lets say we have 10 commits w/ already FILES enabled. 
> for C11, we are enabling col-stats. 
> after data table business, when we enter metadata writer instantiation, we 
> deduct that col-stats has to be instantiated and then instantiate using DC11. 
> in MDT timeline, we see dc11.req. dc11.inflight and dc11.complete. and then 
> we go ahead and apply actual C11 from DT to MDT (dc11.inflight and 
> dc11.complete is updated). here, we overwrite the same DC11 w/ records 
> pertaining to C11. 
> which is buggy. we definitely need to fix this. 
> We can add a suffix to C11 (say C11_003 or C11_001) as we do for compaction 
> and clean in MDT so that any additional operation in MDT has a diff commit 
> time format. For everything else, it should match w/ DT 1 on 1. 
>  
>  
> Impact:
> We are over-riding the same DC for two purposes which is bad. if there is a 
> crash after initializing col-stats and before applying actual C11(in above 
> context), we might mistakenly rollback col-stats initialization, but still 
> table config could say that col stats is fully ready to be served. But while 
> reading MDT, we may not read DC11 since its a failed commit. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6126) Fix test testInsertDatasetWIthTimelineTimezoneUTC to not block CI

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-6126:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Fix test testInsertDatasetWIthTimelineTimezoneUTC to not block CI
> -
>
> Key: HUDI-6126
> URL: https://issues.apache.org/jira/browse/HUDI-6126
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5997) Support DFS Schema Provider with S3/GCS EventsHoodieIncrSource

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-5997:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Support DFS Schema Provider with S3/GCS EventsHoodieIncrSource
> --
>
> Key: HUDI-5997
> URL: https://issues.apache.org/jira/browse/HUDI-5997
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Sagar Sumit
>Assignee: Léo Biscassi
>Priority: Major
> Fix For: 0.14.1
>
>
> See for more details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2681) Make hoodie record_key and preCombine_key optional

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-2681:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Make hoodie record_key and preCombine_key optional
> --
>
> Key: HUDI-2681
> URL: https://issues.apache.org/jira/browse/HUDI-2681
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core, spark-sql, writer-core
>Reporter: Vinoth Govindarajan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> At present, Hudi needs an record key and preCombine key to create an Hudi 
> datasets, which puts an restriction on the kinds of datasets we can create 
> using Hudi.
>  
> In order to increase the adoption of Hudi file format across all kinds of 
> derived datasets, similar to Parquet/ORC, we need to offer flexibility to 
> users. I understand that record key is used for upsert primitive and we need 
> preCombine key to break the tie and deduplicate, but there are event data and 
> other datasets without any primary key (append only datasets), which can 
> benefit from Hudi since Hudi ecosystem offers other features such as snapshot 
> isolation, indexes, clustering, delta streamer etc., which could be applied 
> to any datasets without record key.
>  
> The idea of this proposal is to make both the record key and preCombine key 
> optional to allow variety of new use cases on top of Hudi.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1120) Support spotless for scala

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-1120:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Support spotless for scala
> --
>
> Key: HUDI-1120
> URL: https://issues.apache.org/jira/browse/HUDI-1120
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: pull-request-available, sev:normal, user-support-issues
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2151) Make performant out-of-box configs

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-2151:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Make performant out-of-box configs
> --
>
> Key: HUDI-2151
> URL: https://issues.apache.org/jira/browse/HUDI-2151
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, docs, writer-core
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> We have quite a few configs which deliver better performance or usability, 
> but guarded by flags. 
>  This is to identify them, change them, test (functionally, perf) and make 
> them default
>  
> Need to ensure we also capture all the backwards compatibility issues that 
> can arise



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6112) Improve Doc generatiion to generate config tables for basic and advanced configs

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-6112:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Improve Doc generatiion to generate config tables for basic and advanced 
> configs
> 
>
> Key: HUDI-6112
> URL: https://issues.apache.org/jira/browse/HUDI-6112
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> The HoodieConfigDocGenerator will need to be modified such that:
>  * Each config group has two sections: basic configs and advanced configs
>  * Basic configs and Advanced configs are played out in a table instead of a 
> serially like today.
>  * Among each of these tables the required configs are bubbled up to the top 
> of the table and highlighted.
> Add UI fixes to support a table layout



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6075) Improve config generation script and docs

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-6075:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Improve config generation script and docs
> -
>
> Key: HUDI-6075
> URL: https://issues.apache.org/jira/browse/HUDI-6075
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: configs
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6116) Optimize log block reading by removing seeks to check corrupted blocks

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-6116:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Optimize log block reading by removing seeks to check corrupted blocks
> --
>
> Key: HUDI-6116
> URL: https://issues.apache.org/jira/browse/HUDI-6116
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> The code currently does an eager isCorruptedCheck for which we do a seek and 
> then a read which invalidates our internal buffers in opened file stream to 
> the log file and makes a call to DataNode to start a new blockReader.
> The seek + read becomes apparent when we do cross datacenter reads or where 
> the latency to the file is HIGH. In cases, a single RPC will cost us about 
> 120ms + Cost of RPC (west coast to east coast) so this seek is bad for 
> performance.
> Delaying the corrupt check also gives us many benefits in low latency env 
> where we see times reducing from (5 to 8 sec) to (3s to < 500ms) for a 
> moderately sized files of 250MB.
> NOTE:  The more number of log blocks to read, the greater the performance 
> improvements.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4569) [RFC-59] Multiple event_time fields latest verification in a single table

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4569:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)
   (was: 1.1.0)

> [RFC-59] Multiple event_time fields latest verification in a single table
> -
>
> Key: HUDI-4569
> URL: https://issues.apache.org/jira/browse/HUDI-4569
> Project: Apache Hudi
>  Issue Type: Epic
>Reporter: Richard Xinyao Tian
>Assignee: Richard Xinyao Tian
>Priority: Major
>  Labels: features, pull-request-available
> Fix For: 0.14.1
>
>
> This Jira tracks all sub-tasks related to RFC-59, which would help Hudi to 
> achieve a new feature highly demanded by Finance related industries, 
> temporarily named "Multiple event_time fields latest verification in a single 
> table". This feature would improve Hudi to have the ability to verify 
> multiple event_time fields as the latest, thus could enable Hudi to support 
> scenarios where complex Join operations have been executed.
> We're very keen to make this new feature available to everyone. Since we 
> benefit from the Hudi community, so we really desire to give back to the 
> community with our efforts.
> For those who are interested in why this feature is desperately desired, 
> please move to the new RFC Request discussion, which could be found in the 
> link below: [https://lists.apache.org/thread/dlkgn1knknhl3z2gwtvchd618tj399z9]
>  
> Stages Management:
>  # Briefly illustrate this idea in the dev-maillist and discuss it with 
> comitters [Done]
>  # Raise a PR to claim a RFC-number for further steps [Done]
>  # Write formal RFC materials to describe design and code implementation 
> [Done] 
>  # Finish Writing RFC materials and request a PR for submitting them [Done]
>  # Wait for comments from at least two PMCs and confirm of approval [Current 
> Stage]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5175) Improving FileIndex load performance in PARALLELISM mode

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-5175:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Improving FileIndex load performance in PARALLELISM mode
> 
>
> Key: HUDI-5175
> URL: https://issues.apache.org/jira/browse/HUDI-5175
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: index
>Reporter: Yue Zhang
>Assignee: Yue Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3068) Add support to sync all partitions in hive sync tool

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-3068:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Add support to sync all partitions in hive sync tool
> 
>
> Key: HUDI-3068
> URL: https://issues.apache.org/jira/browse/HUDI-3068
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: meta-sync
>Reporter: sivabalan narayanan
>Assignee: Harshal Patil
>Priority: Blocker
>  Labels: pull-request-available, sev:critical
> Fix For: 0.14.1
>
>
> If a user runs hive sync occationally and if archival kicked in and trimmed 
> some commits and if there were partitions added during those commits which 
> was never updated later, hive sync will miss out those partitions. 
> {code:java}
>   LOG.info("Last commit time synced is " + lastCommitTimeSynced.get() + ", 
> Getting commits since then");
>   return 
> TimelineUtils.getPartitionsWritten(metaClient.getActiveTimeline().getCommitsTimeline()
>   .findInstantsAfter(lastCommitTimeSynced.get(), Integer.MAX_VALUE));
> } {code}
> bcoz, we for recurrent syncs, we always fetch new commits from timeline after 
> the last synced instant and fetch commit metadata and go on to fetch the 
> partitions added as part of it. 
>  
> We can add a new config to hive sync tool to override this behavior. 
> --sync-all-partitions 
> when this config is set to true, we should ignore last synced instant and 
> should go the below route which is done when syncing for the first time. 
>  
> {code:java}
> if (!lastCommitTimeSynced.isPresent()) {
>   LOG.info("Last commit time synced is not known, listing all partitions in " 
> + basePath + ",FS :" + fs);
>   HoodieLocalEngineContext engineContext = new 
> HoodieLocalEngineContext(metaClient.getHadoopConf());
>   return FSUtils.getAllPartitionPaths(engineContext, basePath, 
> useFileListingFromMetadata, assumeDatePartitioning);
> } {code}
>  
>  
> Ref issue: 
> https://github.com/apache/hudi/issues/3890



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5131) Bundle validation: upgrade/downgrade

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-5131:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Bundle validation: upgrade/downgrade
> 
>
> Key: HUDI-5131
> URL: https://issues.apache.org/jira/browse/HUDI-5131
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2517) Simplify the amount of configs that need to be passed in for Delta Streamer

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-2517:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Simplify the amount of configs that need to be passed in for Delta Streamer
> ---
>
> Key: HUDI-2517
> URL: https://issues.apache.org/jira/browse/HUDI-2517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, configs
>Reporter: Vinoth Chandar
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4937) Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT readers

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4937:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT 
> readers
> -
>
> Key: HUDI-4937
> URL: https://issues.apache.org/jira/browse/HUDI-4937
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, writer-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Currently, `HoodieTable` is holding `HoodieBackedTableMetadata` that is setup 
> not to reuse actual LogScanner and HFileReader used to read MT itself.
> This is proving to be wasteful on a number of occasions already, including 
> (not an exhaustive list):
> https://github.com/apache/hudi/issues/6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2832) [Umbrella] [RFC-40] Integrated Hudi with Snowflake

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-2832:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> [Umbrella] [RFC-40] Integrated Hudi with Snowflake 
> ---
>
> Key: HUDI-2832
> URL: https://issues.apache.org/jira/browse/HUDI-2832
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: Common Core
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Critical
>  Labels: BigQuery, Integration, pull-request-available
> Fix For: 0.14.1
>
>
> Snowflake is a fully managed service that’s simple to use but can power a 
> near-unlimited number of concurrent workloads. Snowflake is a solution for 
> data warehousing, data lakes, data engineering, data science, data 
> application development, and securely sharing and consuming shared data. 
> Snowflake [doesn’t 
> support|https://docs.snowflake.com/en/sql-reference/sql/alter-file-format.html]
>  Apache Hudi file format yet, but it has support for Parquet, ORC, and Delta 
> file format. This proposal is to implement a SnowflakeSync similar to 
> HiveSync to sync the Hudi table as the Snowflake External Parquet table so 
> that users can query the Hudi tables using Snowflake. Many users have 
> expressed interest in Hudi and other support channels asking to integrate 
> Hudi with Snowflake, this will unlock new use cases for Hudi.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5079) Optimize rdd.isEmpty within DeltaSync

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-5079:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Optimize rdd.isEmpty within DeltaSync
> -
>
> Key: HUDI-5079
> URL: https://issues.apache.org/jira/browse/HUDI-5079
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> We are calling rdd.isEmpty for source rdd twice in DeltaSync. we should try 
> and optimize/reuse. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2687) [UMBRELLA] A new Trino connector for Hudi

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-2687:
-
Fix Version/s: 0.14.1
   (was: 1.0.0)
   (was: 0.14.0)
   (was: 0.15.0)

> [UMBRELLA] A new Trino connector for Hudi
> -
>
> Key: HUDI-2687
> URL: https://issues.apache.org/jira/browse/HUDI-2687
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: trino-presto
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Critical
>  Labels: hudi-umbrellas
> Fix For: 0.14.1
>
> Attachments: image-2021-11-05-14-16-57-324.png, 
> image-2021-11-05-14-17-03-211.png
>
>
> This JIRA tracks all the tasks related to building a new Hudi connector in 
> Trino.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5092) Querying Hudi table throws NoSuchMethodError in Databricks runtime

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-5092:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Querying Hudi table throws NoSuchMethodError in Databricks runtime 
> ---
>
> Key: HUDI-5092
> URL: https://issues.apache.org/jira/browse/HUDI-5092
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.0
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.1
>
> Attachments: image (1).png, image.png
>
>
> Originally reported by the user: 
> [https://github.com/apache/hudi/issues/6137]
>  
> Crux of the issue is that Databricks's DBR runtime diverges from OSS Spark, 
> and in that case `FileStatusCache` API is very clearly divergent b/w the two. 
> There are a few approaches we can take: 
>  # Avoid reliance on Spark's FIleStatusCache implementation altogether and 
> rely on our own one
>  # Apply more staggered approach where we first try to use Spark's 
> FileStatusCache and if it doesn't match expected API, we fallback to our own 
> impl
>  
> Approach # 1  would actually mean that we're not sharing cache implementation 
> w/ Spark, which in turn would entail that in some cases we might be keeping 2 
> instances of the same cache. Approach # 2 remediates that and allows us to 
> only fallback in case API is not compatible. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4287) Optimize Flink checkpoint meta mechanism to fix mistaken pending instants

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4287:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Optimize Flink checkpoint meta mechanism to fix mistaken pending instants
> -
>
> Key: HUDI-4287
> URL: https://issues.apache.org/jira/browse/HUDI-4287
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Reporter: Shizhi Chen
>Assignee: Shizhi Chen
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.1
>
> Attachments: image-2022-06-27-19-42-14-676.png, 
> image-2022-06-27-19-55-20-210.png, image-2022-06-27-20-07-55-984.png, 
> image-2022-06-27-20-11-47-939.png, image-2022-06-27-20-29-49-897.png
>
>
> *Problem reveiw*
> CkpMetadata is introduced into flink module to reduce timeline burden, but 
> currently its 
> mechanism lacks corresponding status for rollback instants, which may result 
> in commit/delta commit instants deletion, and thus 
> StreamWriteOperatorCoordinator(meta end) and Write function(data end) will 
> not be coordinatited correctly.
> Finally, data files will be deleted by mistake.
> This situation will be easy to reproduced especially when 
> StreamWriteOperatorCoordinator schedules table services for a long time 
> between commit and init instants after the restoration from a checkpoint.
>  
> *Stable Reproduction Proccedure*
>  * a. Before starting a job, let's modify the 
> StreamWriteOperatorCoordinator#notifyCheckpointComplete like:
> !image-2022-06-27-19-42-14-676.png|width=479,height=293! 
> It does nothing but to mock the possible long time table services for fast 
> reproduction.
>  * b. Start a simple flink hudi job such as append, and don't hesitate to 
> kill it when the 2nd checkpoint is in INFLIGHT.
>  * c. Let's restart it from the checkpoint restoration, it'll be sure to hit 
> the case after another 2 checkpoints, which may be accompanied by the 
> FileNotFoundException:
> !image-2022-06-27-20-29-49-897.png|width=503,height=386! 
> More important, we could observe the incoordination:
> !image-2022-06-27-20-07-55-984.png|width=517,height=109! 
> The screenshot above shows that the instant should be 20220531163135119 in 
> 2022-05-31 16:36 which is committed by StreamWriteOperatorCoordinator as meta 
> end.
> !image-2022-06-27-20-11-47-939.png|width=517,height=155! 
> At the same time, the data files are written with the wrong base commit 
> instant: 20220531161923191, which is deleted during rollbacks in procedure c. 
> for its uncompletement and also should have been evicted from ckp_meta.
>  
> *Solution*
> The solution is to optimize the mechanism with CANCELLED CkpMessage state in 
> the highest priority corresponding with DELETE instant during rollback action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3957) Evaluate Support for spark2 and scala12

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-3957:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Evaluate Support for spark2 and scala12 
> 
>
> Key: HUDI-3957
> URL: https://issues.apache.org/jira/browse/HUDI-3957
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Minor
> Fix For: 0.14.1
>
> Attachments: Screen Shot 2022-05-05 at 8.51.11 AM.png, Screen Shot 
> 2022-05-05 at 8.53.39 AM.png
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> We may need to evaluate the need for supporting spark2 and scala 12 and 
> deprecate if there is not much usage. 
>  
> From the overall stats, hudi-spark_2.12 bundle usage is 2%. Among 
> hudi_spark2.12 bundle usages, most usages are from 0.7, 0.8. and 0.9. 0.10 
> and above are ~ 5% among all usages of hudi-spark2.12 bundle. So, probably we 
> can deprecate the usage of spark2 and scala12 going forward and ask users to 
> use spark3. 
>  
> !Screen Shot 2022-05-05 at 8.51.11 AM.png!
>  
>  
> !Screen Shot 2022-05-05 at 8.53.39 AM.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4105) Identify out of the box performance config flips for spark-ds

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4105:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Identify out of the box performance config flips for spark-ds
> -
>
> Key: HUDI-4105
> URL: https://issues.apache.org/jira/browse/HUDI-4105
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: configs
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.14.1
>
>
> we need to identify out of the box performance flips. Refer to HUDI-2151 for 
> older ticket. But we need to comb through all configs once again and come up 
> with an updated list. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4449) Spark: Support DataSourceV2 Read

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4449:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Spark: Support DataSourceV2  Read
> -
>
> Key: HUDI-4449
> URL: https://issues.apache.org/jira/browse/HUDI-4449
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: reader-core, spark, spark-sql
>Reporter: chenliang
>Assignee: chenliang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Introduce v2 reading interface and define {{HoodieBatchScanBuilder}} to 
> provide querying capability.ColumnPrune & PushDown  is follow up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-956) Test MOR : Presto Realtime Query with metadata bootstrap

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-956:

Fix Version/s: 0.14.1
   (was: 0.14.0)

> Test MOR : Presto Realtime Query with metadata bootstrap
> 
>
> Key: HUDI-956
> URL: https://issues.apache.org/jira/browse/HUDI-956
> Project: Apache Hudi
>  Issue Type: Task
>  Components: trino-presto
>Reporter: Balaji Varadarajan
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1258) Small file handling Merges can be handled without actual merging

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-1258:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Small file handling Merges can be handled without actual merging
> 
>
> Key: HUDI-1258
> URL: https://issues.apache.org/jira/browse/HUDI-1258
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: hudi-on-call
> Fix For: 0.14.1
>
>
> If a file slice gets inserts into MergeHandle, for file sizing reasons, there 
> is no reason to really build the hashmap and merge. 
>  
> This will also avoid the issue of insert with the same duplicate key 
> overwriting the previous value 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6282) Config conflict with Deltastreamer CustomKeyGenerator for PartitionPath

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-6282:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Config conflict with Deltastreamer CustomKeyGenerator for PartitionPath
> ---
>
> Key: HUDI-6282
> URL: https://issues.apache.org/jira/browse/HUDI-6282
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Aditya Goenka
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> With the debezium source while using CustomKeyGenerator, there is need to 
> pass the `:Timestamp` for the partition path for the first run but then it 
> runs without it.
>  
> Github Issue - [https://github.com/apache/hudi/issues/8372]
>  
> Details to Reproduce the issue- 
> [https://gist.github.com/ad1happy2go/49b81f015c1a2964fee489214658cf44]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4967) Improve docs for meta sync with TimestampBasedKeyGenerator

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4967:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Improve docs for meta sync with TimestampBasedKeyGenerator
> --
>
> Key: HUDI-4967
> URL: https://issues.apache.org/jira/browse/HUDI-4967
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Related fix: HUDI-4966
> We need to add docs on how to properly set the meta sync configuration, 
> especially the hoodie.datasource.hive_sync.partition_value_extractor, in 
> [https://hudi.apache.org/docs/key_generation] (for different Hudi versions, 
> the config can be different).  Check the ticket above and PR description of 
> [https://github.com/apache/hudi/pull/6851] for more details.
> We should also add the migration setup on the key generation page as well: 
> [https://hudi.apache.org/releases/release-0.12.0/#configuration-updates]
>  * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config 
> is used to extract and transform partition value during Hive sync. Its 
> default value has been changed from 
> {{SlashEncodedDayPartitionValueExtractor}} to 
> {{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default 
> value (i.e., have not set it explicitly), you are required to set the config 
> to {{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From 
> this release, if this config is not set and Hive sync is enabled, then 
> partition value extractor class will be *automatically inferred* on the basis 
> of number of partition fields and whether or not hive style partitioning is 
> enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-992) For hive-style partitioned source data, partition columns synced with Hive will always have String type

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-992:

Fix Version/s: 0.14.1
   (was: 0.14.0)

> For hive-style partitioned source data, partition columns synced with Hive 
> will always have String type
> ---
>
> Key: HUDI-992
> URL: https://issues.apache.org/jira/browse/HUDI-992
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap, meta-sync
>Affects Versions: 0.9.0
>Reporter: Udit Mehrotra
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.1
>
>
> Currently bootstrap implementation is not able to handle partition columns 
> correctly when the source data has *hive-style partitioning*, as is also 
> mentioned in https://jira.apache.org/jira/browse/HUDI-915
> The schema inferred while performing bootstrap and stored in the commit 
> metadata does not have partition column schema(in case of hive partitioned 
> data). As a result during hive-sync when hudi tries to determine the type of 
> partition column from that schema, it would not find it and assume the 
> default data type *string*.
> Here is where partition column schema is determined for hive-sync:
> [https://github.com/apache/hudi/blob/master/hudi-hive-sync/src/main/java/org/apache/hudi/hive/util/HiveSchemaUtil.java#L417]
>  
> Thus no matter what the data type of partition column is in the source data 
> (atleast what spark infers it as from the path), it will always be synced as 
> string.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6136) Add idempotency support to spark datasource writes

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-6136:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Add idempotency support to spark datasource writes
> --
>
> Key: HUDI-6136
> URL: https://issues.apache.org/jira/browse/HUDI-6136
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Would be good to add idempotency support to spark data-source writes. 
> Essentially, if the same batch is ingested again to hudi, hudi should skip 
> them. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1280) Add tool to capture earliest or latest offsets in kafka topics

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-1280:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Add tool to capture earliest or latest offsets in kafka topics 
> ---
>
> Key: HUDI-1280
> URL: https://issues.apache.org/jira/browse/HUDI-1280
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: deltastreamer
>Reporter: Balaji Varadarajan
>Assignee: Trevorzhang
>Priority: Major
> Fix For: 0.14.1
>
>
> For bootstrapping cases using spark.write(), we need to capture offsets from 
> kafka topic and use it as checkpoint for subsequent read from Kafka topics.
>  
> [https://github.com/apache/hudi/issues/1985]
> We need to build this integration for smooth transition to deltastreamer.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2516) Upgrade to Junit 5.8.2

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-2516:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Upgrade to Junit 5.8.2
> --
>
> Key: HUDI-2516
> URL: https://issues.apache.org/jira/browse/HUDI-2516
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Testing, tests-ci
>Reporter: Raymond Xu
>Assignee: liujinhui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5018) Make user-provided copyOnWriteRecordSizeEstimate first precedence

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-5018:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Make user-provided copyOnWriteRecordSizeEstimate first precedence
> -
>
> Key: HUDI-5018
> URL: https://issues.apache.org/jira/browse/HUDI-5018
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> For estimated avg record size
> https://hudi.apache.org/docs/configurations/#hoodiecopyonwriterecordsizeestimate
> which is used here
> https://github.com/apache/hudi/blob/86a1efbff1300603a8180111eae117c7f9dbd8a5/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L372
> Propose to respect user setting by following the precedence as below
> 1) if user sets a value, then use it as is 
> 2) if user not setting it, infer from timeline commit metadata 
> 3) if timeline is empty, use a default (current: 1024)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1549) Programmatic way to fetch earliest commit retained

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-1549:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Programmatic way to fetch earliest commit retained 
> ---
>
> Key: HUDI-1549
> URL: https://issues.apache.org/jira/browse/HUDI-1549
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: cleaning, timeline-server
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: query-eng, sev:normal, user-support-issues
> Fix For: 0.14.1
>
>
> For GDPR deletions, it would be nice if customers can programmatically know 
> whats the earliest commit retained. 
> More context: https://github.com/apache/hudi/issues/2135



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1936) Introduce a optional property for conditional upsert

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-1936:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Introduce a optional property for conditional upsert 
> -
>
> Key: HUDI-1936
> URL: https://issues.apache.org/jira/browse/HUDI-1936
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core, writer-core
>Reporter: Biswajit mohapatra
>Assignee: Biswajit mohapatra
>Priority: Major
>  Labels: features, pull-request-available, sev:high
> Fix For: 0.14.1
>
>
> If anyone wants to use custom upsert logic then they have to override the 
> Latest avro payload class which is only possible in java or scala . 
> Python developers have no such option . 
> Will be introducing a new payload class and a new key which can work in java 
> , scala and python 
> This class will be responsible for custom upsert logic and a new key 
> hoodie.update.key which will accept the columns which only need to be updated 
>  
> "hoodie.update.keys": "admission_date,name",  #comma seperated key 
> "hoodie.datasource.write.payload.class": "com.hudiUpsert.hudiCustomUpsert" 
> #custom upsert key 
>  
> so this will only update the column admission_date and name in the target 
> table 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3582) Introduce Secondary Index to Improve HUDI Query Performance

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-3582:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Introduce Secondary Index to Improve HUDI Query Performance
> ---
>
> Key: HUDI-3582
> URL: https://issues.apache.org/jira/browse/HUDI-3582
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: index
>Reporter: shibei
>Assignee: shibei
>Priority: Blocker
> Fix For: 0.14.1
>
>
> In query processing, we need to scan many data blocks in HUDI table. However, 
> most of them may not match the query predicate even after using column 
> statistic info in the metadata table, row group level or page level 
> statistics in parquet files, etc.
> The total data size of touched blocks determines the query speed, and how to 
> save IO has become the key point to improving query performance. To address 
> this problem, we introduce secondary index to improve hudi query performance.
> In the initial implementation, secondary index based on lucene will be 
> carried out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3560) Add docker image for spark3 hadoop3 and hive3

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-3560:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Add docker image for spark3 hadoop3 and hive3
> -
>
> Key: HUDI-3560
> URL: https://issues.apache.org/jira/browse/HUDI-3560
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: Rahil Chertara
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1588) Support multiple ordering fields via payload class config

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-1588:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Support multiple ordering fields via payload class config
> -
>
> Key: HUDI-1588
> URL: https://issues.apache.org/jira/browse/HUDI-1588
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Affects Versions: 0.9.0
>Reporter: Raymond Xu
>Assignee: Pratyaksh Sharma
>Priority: Major
> Fix For: 0.14.1
>
>
> To make configuration simpler, we want to deprecate --source-ordering-field 
> config and combine it with payload class config so that users can plug in 
> custom payload class that handles record ordering. Also the logic can be 
> extended to look up for multiple fields for ordering.
>  
> Discussion thread
> https://lists.apache.org/thread.html/r884e47f21ce1f3f09ef12722fa05ba9900ba12429935c10167b9fce6%40%3Cdev.hudi.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6122) Call clean/compaction support custom options

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-6122:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Call clean/compaction support custom options
> 
>
> Key: HUDI-6122
> URL: https://issues.apache.org/jira/browse/HUDI-6122
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: zouxxyy
>Assignee: zouxxyy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5951) Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-5951:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource
> ---
>
> Key: HUDI-5951
> URL: https://issues.apache.org/jira/browse/HUDI-5951
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> HUDI-372 adds support for the short name "hudi" in Spark Datasource read and 
> write (df.write.format("hudi"), df.read.format("hudi")).  All places should 
> use "hudi" with format() now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4626) Partitioning table by `_hoodie_partition_path` fails

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4626:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Partitioning table by `_hoodie_partition_path` fails
> 
>
> Key: HUDI-4626
> URL: https://issues.apache.org/jira/browse/HUDI-4626
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.14.1
>
>
>  
> Currently, creating a table partitioned by "_hoodie_partition_path" using 
> Glue catalog fails w/ the following exception:
> {code:java}
> AnalysisException: Found duplicate column(s) in the data schema and the 
> partition schema: _hoodie_partition_path
> {code}
> Using following DDL:
> {code:java}
> CREATE EXTERNAL TABLE `active_storage_attachments`(  `_hoodie_commit_time` 
> string COMMENT '',   `_hoodie_commit_seqno` string COMMENT '',   
> `_hoodie_record_key` string COMMENT '',   `_hoodie_file_name` string COMMENT 
> '',   `_change_operation_type` string COMMENT '',   
> `_upstream_event_processed_ts_ms` bigint COMMENT '',   
> `db_shard_source_partition` string COMMENT '',   `_event_origin_ts_ms` bigint 
> COMMENT '',   `_event_tx_id` bigint COMMENT '',   `_event_lsn` bigint COMMENT 
> '',   `_event_xmin` bigint COMMENT '',   `id` bigint COMMENT '',   `name` 
> string COMMENT '',   `record_type` string COMMENT '',   `record_id` bigint 
> COMMENT '',   `blob_id` bigint COMMENT '',   `created_at` timestamp COMMENT 
> '')PARTITIONED BY (   `_hoodie_partition_path` string COMMENT '')ROW FORMAT 
> SERDE   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH 
> SERDEPROPERTIES (   'hoodie.query.as.ro.table'='false',   'path'='...') 
> STORED AS INPUTFORMAT   'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
> OUTPUTFORMAT   
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'LOCATION  
> '...'
> TBLPROPERTIES (  'spark.sql.sources.provider'='hudi' )
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4674) change the default value of inputFormat for the MOR table

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4674:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> change the default value of inputFormat for the MOR table
> -
>
> Key: HUDI-4674
> URL: https://issues.apache.org/jira/browse/HUDI-4674
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: linfey.nie
>Assignee: linfey.nie
>Priority: Major
>  Labels: hudi-on-call, pull-request-available
> Fix For: 0.14.1
>
>
> When we build a mor table, for example with Sparksql,the default value of 
> inputFormat is HoodieParquetRealtimeInputFormat.but when use hive sync 
> metadata and skip the _ro suffix for Read,The inputFormat of the original 
> table name should be HoodieParquetInputFormat,but now is not.I think we 
> should change the default value of inputFormat,just like cow table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2740) Support for snapshot querying on MOR table

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-2740:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Support for snapshot querying on MOR table
> --
>
> Key: HUDI-2740
> URL: https://issues.apache.org/jira/browse/HUDI-2740
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4324) Remove useJdbc config from meta sync tools

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4324:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Remove useJdbc config from meta sync tools
> --
>
> Key: HUDI-4324
> URL: https://issues.apache.org/jira/browse/HUDI-4324
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: meta-sync
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1872) Move HoodieFlinkStreamer into hudi-utilities module

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-1872:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Move HoodieFlinkStreamer into hudi-utilities module
> ---
>
> Key: HUDI-1872
> URL: https://issues.apache.org/jira/browse/HUDI-1872
> Project: Apache Hudi
>  Issue Type: Task
>  Components: flink
>Reporter: Danny Chen
>Assignee: Vinay
>Priority: Major
>  Labels: pull-request-available, sev:normal
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4613) Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4613:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Avoid the use of regex expressions when call hoodieFileGroup#addLogFile 
> function
> 
>
> Key: HUDI-4613
> URL: https://issues.apache.org/jira/browse/HUDI-4613
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: lei w
>Assignee: lei w
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
> Attachments: withChange.png, withoutChange.png
>
>
> When the number of logFile files exceeds a certain amount of data, the 
> construction of fsview will become very time-consuming. The reason is that 
> the LogFileComparator#compare method is frequently called when constructing a 
> filegroup, and regular expressions are used in this method.
> {panel:title=build FileSystemView Log }
>  INFO view.AbstractTableFileSystemView: addFilesToView: NumFiles=60801, 
> NumFileGroups=200, FileGroupsCreationTime=34036, StoreTimeTaken=2
> {panel}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5575) Support any record key generation along w/ any partition path generation for row writer

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-5575:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Support any record key generation along w/ any partition path generation for 
> row writer
> ---
>
> Key: HUDI-5575
> URL: https://issues.apache.org/jira/browse/HUDI-5575
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Lokesh Jain
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> HUDI-5535 adds support for record key generation along w/ any partition path 
> generation. It also separates the record key generation and partition path 
> generation into separate interfaces.
> This jira aims to add similar support for the row writer path in spark.
> cc [~shivnarayan] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2034) Support explicit partition compaction strategy for flink batch compaction

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-2034:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Support explicit partition compaction strategy for flink batch compaction 
> --
>
> Key: HUDI-2034
> URL: https://issues.apache.org/jira/browse/HUDI-2034
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction, flink
>Reporter: Zheng yunhong
>Assignee: Zheng yunhong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Support explicit partition compaction strategy for flink batch compaction.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4990) Parallelize deduplication in CLI tool

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4990:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Parallelize deduplication in CLI tool
> -
>
> Key: HUDI-4990
> URL: https://issues.apache.org/jira/browse/HUDI-4990
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Minor
> Fix For: 0.14.1
>
>
> The CLI tool command `repair deduplicate` repair one partition at a time.  To 
> repair hundreds of partitions, this takes time.  We should add a mode to take 
> multiple partition paths for the CLI and run the dedup job for multiple 
> partitions at the same time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4106) Identify out of the box default performance flips for spark-sql

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4106:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Identify out of the box default performance flips for spark-sql
> ---
>
> Key: HUDI-4106
> URL: https://issues.apache.org/jira/browse/HUDI-4106
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.14.1
>
>
> We had HUDI-2151 to track performance flips, but its been 1 year that we 
> combed through all configs. Lets do another round of combing through all 
> configs and come up with a new list to flip. 
> this ticket specifically tracks spark-sql layer configs. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3954) Don't keep the last commit before the earliest commit to retain

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-3954:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Don't keep the last commit before the earliest commit to retain
> ---
>
> Key: HUDI-3954
> URL: https://issues.apache.org/jira/browse/HUDI-3954
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cleaning
>Reporter: 董可伦
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Don't keep the last commit before the earliest commit to retain
> According to the document of {{{}hoodie.cleaner.commits.retained{}}}:
> Number of commits to retain, without cleaning. This will be retained for 
> num_of_commits * time_between_commits (scheduled). This also directly 
> translates into how much data retention the table supports for incremental 
> queries.
>  
> We only need to keep the number of commit configured through parameters 
> {{{}hoodie.cleaner.commits.retained{}}}.
> And the commit retained by clean is completed.This ensures that “This will be 
> retained for num_of_commits * time_between_commits” in the document.
> So we don't need to keep the last commit before the earliest commit to 
> retain,If we want to keep more versions, we can increase the parameters 
> {{hoodie.cleaner.commits.retained}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6464) Implement Spark SQL Merge Into for tables without primary key

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-6464:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Implement Spark SQL Merge Into for tables without primary key
> -
>
> Key: HUDI-6464
> URL: https://issues.apache.org/jira/browse/HUDI-6464
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark-sql
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Merge Into currently only matches on the primary key which pkless tables 
> don't have



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5101) Adding spark structured streaming tests to integ tests

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-5101:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Adding spark structured streaming tests to integ tests
> --
>
> Key: HUDI-5101
> URL: https://issues.apache.org/jira/browse/HUDI-5101
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6552) Restructure FAQ on the website to categorize tuning and troubleshooting related questions separately.

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-6552:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Restructure FAQ on the website to categorize tuning and troubleshooting 
> related questions separately. 
> --
>
> Key: HUDI-6552
> URL: https://issues.apache.org/jira/browse/HUDI-6552
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3291) Flip Default record paylod to DefaultHoodieRecordPayload

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-3291:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Flip Default record paylod to DefaultHoodieRecordPayload
> 
>
> Key: HUDI-3291
> URL: https://issues.apache.org/jira/browse/HUDI-3291
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-304) Bring back spotless plugin

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-304:

Fix Version/s: 0.14.1
   (was: 0.14.0)

> Bring back spotless plugin 
> ---
>
> Key: HUDI-304
> URL: https://issues.apache.org/jira/browse/HUDI-304
> Project: Apache Hudi
>  Issue Type: Task
>  Components: code-quality, dev-experience, Testing
>Reporter: Balaji Varadarajan
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> spotless plugin has been turned off as the eclipse style format it was 
> referencing was removed due to compliance reasons. 
> We use google style eclipse format with some changes
> 90c90
> < 
> ---
> > 
> 242c242
> <  value="100"/>
> ---
> >  > value="120"/>
>  
> The eclipse style sheet was originally obtained from 
> [https://github.com/google/styleguide] which CC -By 3.0 license which is not 
> compatible for source distribution (See 
> [https://www.apache.org/legal/resolved.html#cc-by]) 
>  
> We need to figure out a way to bring this back
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4345) Incorporate partition pruning for COW incremental query

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4345:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Incorporate partition pruning for COW incremental query
> ---
>
> Key: HUDI-4345
> URL: https://issues.apache.org/jira/browse/HUDI-4345
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2296) flink support ConsistencyGuard plugin

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-2296:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> flink support  ConsistencyGuard plugin 
> ---
>
> Key: HUDI-2296
> URL: https://issues.apache.org/jira/browse/HUDI-2296
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink
>Reporter: Gengxuhan
>Assignee: Gengxuhan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> flink support ConsistencyGuard plugin,When there is metadata latency in the 
> file system,Submit CKP once, and then start a new instant. Because the commit 
> cannot be seen, the data will be rolled back



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4878) Fix incremental cleaning for clean based on LATEST_FILE_VERSIONS

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4878:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Fix incremental cleaning for clean based on LATEST_FILE_VERSIONS
> 
>
> Key: HUDI-4878
> URL: https://issues.apache.org/jira/browse/HUDI-4878
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cleaning
>Reporter: sivabalan narayanan
>Assignee: nicolas paris
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> clean based on LATEST_FILE_VERSIONS can be improved further since incremental 
> clean is not enabled. lets see if we can improvise. 
>  
> context from author:
>  
>  
> Currently incremental cleaning is run for both KEEP_LATEST_COMMITS, 
> KEEP_LATEST_BY_HOURS
> policies. It is not run when KEEP_LATEST_FILE_VERSIONS.
> This can lead to not cleaning files. This PR fixes this problem by enabling 
> incremental cleaning for KEEP_LATEST_FILE_VERSIONS only.
> Here is the scenario of the problem:
> Say we have 3 committed files in partition-A and we add a new commit in 
> partition-B, and we trigger cleaning for the first time (full partition scan):
>  {{partition-A/
> commit-0.parquet
> commit-1.parquet
> commit-2.parquet
> partition-B/
> commit-3.parquet}}
> In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3, 
> the cleaner will remove the commit-0.parquet to keep 3 commits.
> For the next cleaning, incremental cleaning will trigger, and won't consider 
> partition-A/ until a new commit change it. In case no later commit changes 
> partition-A then commit-1.parquet will stay forever. However it should be 
> removed by the cleaner.
> Now if in case of KEEP_LATEST_FILE_VERSIONS, the cleaner will only keep 
> commit-2.parquet. Then it makes sense that incremental cleaning won't 
> consider partition-A until it is changed. Because there is only one commit.
> This is why incremental cleaning should only be enabled with 
> KEEP_LATEST_FILE_VERSIONS
> Hope this is clear enough
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4688) Decouple lazy rollback of failed writes from clean action in multi-writer

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4688:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Decouple lazy rollback of failed writes from clean action in multi-writer
> -
>
> Key: HUDI-4688
> URL: https://issues.apache.org/jira/browse/HUDI-4688
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.14.1
>
>
> What in case someone disables cleaning or runs only once every 50 commits. 
> The lazy rollback won't happen until 50 commits. So, decouple lazy rollbacks 
> of failed writes from cleaner in multi-writer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6115) Harden expected corrupt record column in chained transformer when error table settings are on/off

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-6115:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Harden expected corrupt record column in chained transformer when error table 
> settings are on/off 
> --
>
> Key: HUDI-6115
> URL: https://issues.apache.org/jira/browse/HUDI-6115
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Harshal Patil
>Assignee: Harshal Patil
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> When error table is enabled and transformers drops existing 
> corruptRecordColumn , that can lead to quarantine records getting dropped . 
> This pr aims at hardening expectation of corruptRecordColumn in output 
> schemas of transformer . 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6149) Add a tool to fetch table size for hudi tables

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-6149:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Add a tool to fetch table size for hudi tables
> --
>
> Key: HUDI-6149
> URL: https://issues.apache.org/jira/browse/HUDI-6149
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3819) upgrade spring cve-2022-22965

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-3819:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> upgrade spring cve-2022-22965
> -
>
> Key: HUDI-3819
> URL: https://issues.apache.org/jira/browse/HUDI-3819
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cli
>Affects Versions: 0.9.0, 0.10.1
>Reporter: Jason-Morries Adam
>Assignee: Sagar Sumit
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> We should upgrade the Spring Framework version at Hudi CLI because of 
> cve-2022-22965. The Qualys Scanner finds these packages and raises a warning 
> because of the existence of these files on the system. 
> The found files are:
> /usr/lib/hudi/cli/lib/spring-beans-4.2.4.RELEASE.jar 
> /usr/lib/hudi/cli/lib/spring-core-4.2.4.RELEASE.jar
> More Information: 
> Spring Framework: https://spring.io/projects/spring-framework
> Spring project spring-framework release notes: 
> https://github.com/spring-projects/spring-framework/releases
> CVE-2022-22965: https://tanzu.vmware.com/security/cve-2022-22965



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3742) Enable parquet enableVectorizedReader for spark incremental read to prevent pef regression

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-3742:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Enable parquet  enableVectorizedReader for spark incremental read to prevent 
> pef regression
> ---
>
> Key: HUDI-3742
> URL: https://issues.apache.org/jira/browse/HUDI-3742
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Tao Meng
>Assignee: Tao Meng
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> now we disable parquet  enableVectorizedReader for mor incremental read,
> and set "spark.sql.parquet.recordLevelFilter.enabled" = "true"  to achieve 
> data filter
> which is slow



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5498) Update docs for reading Hudi tables on Databricks runtime

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-5498:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Update docs for reading Hudi tables on Databricks runtime
> -
>
> Key: HUDI-5498
> URL: https://issues.apache.org/jira/browse/HUDI-5498
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.1
>
>
> We need to document how users can read Hudi tables on Databricks Spark 
> runtime. 
> Relevant fix: [https://github.com/apache/hudi/pull/7088]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2955) Upgrade Hadoop to 3.3.x

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-2955:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Upgrade Hadoop to 3.3.x
> ---
>
> Key: HUDI-2955
> URL: https://issues.apache.org/jira/browse/HUDI-2955
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Rahil Chertara
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.1
>
> Attachments: Screen Shot 2021-12-07 at 2.32.51 PM.png
>
>
> According to Hadoop compatibility matrix, this is a pre-requisite to 
> upgrading to JDK11:
> !Screen Shot 2021-12-07 at 2.32.51 PM.png|width=938,height=230!
> [https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions]
>  
> *Upgrading Hadoop from 2.x to 3.x*
> [https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+2.x+to+3.x+Upgrade+Efforts]
> Everything (relevant to us) seems to be in a good shape, except Spark 2.2/.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6281) Comprehensive schema evolution supports column change with a default value

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-6281:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Comprehensive schema evolution supports column change with a default value
> --
>
> Key: HUDI-6281
> URL: https://issues.apache.org/jira/browse/HUDI-6281
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: core
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Comprehensive schema evolution should support column change with a default 
> value, which could add column with a default value etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5075) Add support to rollback residual clustering after disabling clustering

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-5075:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Add support to rollback residual clustering after disabling clustering
> --
>
> Key: HUDI-5075
> URL: https://issues.apache.org/jira/browse/HUDI-5075
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: clustering
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> if a user enabled clustering and after sometime disabled it due to whatever 
> reason, there is a chance that there is a pending clustering left in the 
> timeline. But once clustering is disabled, this could just be lying around. 
> but this could affect metadata table compaction whcih in turn might affect 
> the data table archival. 
> so, we need a way to fix this. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6121) Log the exception in the hudi commit kafka callback

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-6121:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Log the exception in the hudi commit kafka callback
> ---
>
> Key: HUDI-6121
> URL: https://issues.apache.org/jira/browse/HUDI-6121
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: ziqiao
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Right now the kafka callback does not log the exception, and it would be hard 
> to find out why the kafka message sent failed without log



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3636) Clustering fails due to marker creation failure

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-3636:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Clustering fails due to marker creation failure
> ---
>
> Key: HUDI-3636
> URL: https://issues.apache.org/jira/browse/HUDI-3636
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: multi-writer
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Scenario: multi-writer test, one writer doing ingesting with Deltastreamer 
> continuous mode, COW, inserts, async clustering and cleaning (partitions 
> under 2022/1, 2022/2), another writer with Spark datasource doing backfills 
> to different partitions (2021/12).  
> 0.10.0 no MT, clustering instant is inflight (failing it in the middle before 
> upgrade) ➝ 0.11 MT, with multi-writer configuration the same as before.
> The clustering/replace instant cannot make progress due to marker creation 
> failure, failing the DS ingestion as well.  Need to investigate if this is 
> timeline-server-based marker related or MT related.
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
> stage 46.0 failed 1 times, most recent failure: Lost task 2.0 in stage 46.0 
> (TID 277) (192.168.70.231 executor driver): java.lang.RuntimeException: 
> org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
> 2022/1/24/aa2f24d3-882f-4d48-b20e-9fcd3540c7a7-0_2-46-277_20220314101326706.parquet.marker.CREATE
> Connect to localhost:26754 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] 
> failed: Connection refused (Connection refused)
>     at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
>     at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>     at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>     at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>     at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
>     at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
>     at scala.collection.AbstractIterator.to(Iterator.scala:1431)
>     at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
>     at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
>     at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
>     at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
>     at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
>     at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
>     at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>     at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>     at org.apache.spark.scheduler.Task.run(Task.scala:131)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
> 2022/1/24/aa2f24d3-882f-4d48-b20e-9fcd3540c7a7-0_2-46-277_20220314101326706.parquet.marker.CREATE
> Connect to localhost:26754 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] 
> failed: Connection refused (Connection refused)
>     at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:94)

[jira] [Updated] (HUDI-4329) Add separate control for compaction operation sync/async mode

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4329:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Add separate control for compaction operation sync/async mode
> -
>
> Key: HUDI-4329
> URL: https://issues.apache.org/jira/browse/HUDI-4329
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Shizhi Chen
>Assignee: Shizhi Chen
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> *Problem Review*
> The compact operation sync/async in CompactionFunction is now controlled by 
> FlinkOptions#COMPACTION_ASYNC_ENABLED
> {code:java}
>   public CompactFunction(Configuration conf) {
> this.conf = conf;
> this.asyncCompaction = StreamerUtil.needsAsyncCompaction(conf);
>   }
> {code}
> While in fact it cannot be switched to sync mode because the pipeline defined 
> by sync compaction will only include the clean but not compact operators.
> {code:java}
>   // compaction
>   if (StreamerUtil.needsAsyncCompaction(conf)) {
> return Pipelines.compact(conf, pipeline);
>   } else {
> return Pipelines.clean(conf, pipeline);
>   }
> {code}
> *Improvement*
> Add another separate control switch for compaction operation sync/async mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2733) Adding Thrift support in HiveSyncTool

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-2733:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Adding Thrift support in HiveSyncTool
> -
>
> Key: HUDI-2733
> URL: https://issues.apache.org/jira/browse/HUDI-2733
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: meta-sync, Utilities
>Reporter: Satyam Raj
>Assignee: Satyam Raj
>Priority: Major
>  Labels: hive-sync, pull-request-available
> Fix For: 0.14.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Introduction of Thrift Metastore client to sync Hudi data in Hive warehouse.
> Suggested client to integrate with:
> https://github.com/akolb1/hclient/tree/master/tools-common



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5352) Jackson fails to serialize LocalDate when updating Delta Commit metadata

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-5352:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Jackson fails to serialize LocalDate when updating Delta Commit metadata
> 
>
> Key: HUDI-5352
> URL: https://issues.apache.org/jira/browse/HUDI-5352
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Alexey Kudinkin
>Assignee: Raymond Xu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Currently, running TestColumnStatsIndex on Spark 3.3 fails the MOR tests due 
> to Jackson not being able to serialize LocalData as is and requiring 
> additional JSR310 dependency.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-5238:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Hudi throwing "PipeBroken" exception during Merging on GCS
> --
>
> Key: HUDI-5238
> URL: https://issues.apache.org/jira/browse/HUDI-5238
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Originally reported at [https://github.com/apache/hudi/issues/7234]
> ---
>  
> Root-cause:
> Basically, the reason it’s failing is following: # GCS uses 
> PipeInputStream/PipeOutputStream comprising reading/writing ends of the 
> “pipe” it’s using for unidirectional comm b/w Threads
>  # PipeInputStream (for whatever reason) remembers the thread that actually 
> wrote into the pipe
>  # In BoundedInMemoryQueue we’re bootstrapping new executors (read, threads) 
> for reading and _writing_ (it’s only used in HoodieMergeHandle, and in 
> bulk-insert)
>  # When we’re done writing in HoodieMergeHelper, we’re shutting down *first* 
> BIMQ, then the HoodieMergeHandle, and that’s exactly the reason why it’s 
> failing
>  
> Issue has been introduced at [https://github.com/apache/hudi/pull/4264/files]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3695) Add orc reader in HoodieBaseRelation

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-3695:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Add orc reader in HoodieBaseRelation
> 
>
> Key: HUDI-3695
> URL: https://issues.apache.org/jira/browse/HUDI-3695
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.11.0
>Reporter: miomiocat
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> the ORC reader is not supported in HoodieBaseRelation after HUDI-3338
> It should be restored to read ORC format based hudi table or else a 
> UnsupportedOperationException will be thrown
>  
> code details:
> [https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala#L323]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   >