[jira] [Closed] (HUDI-6153) Change the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks
[ https://issues.apache.org/jira/browse/HUDI-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason closed HUDI-6153. Fix Version/s: 0.14.0 (was: 0.14.1) Resolution: Fixed > Change the rollback mechanism for MDT to actual rollbacks rather than > appending revert blocks > - > > Key: HUDI-6153 > URL: https://issues.apache.org/jira/browse/HUDI-6153 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata >Reporter: Prashant Wason >Assignee: Prashant Wason >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > When rolling back completed commits for indexes like record-index, the list > of all keys removed from the dataset is required. This information cannot be > available during rollback processing in MDT since the files have already been > deleted during the rollback inflight processing. > Hence, the current MDT rollback mechanism of adding -files, -col_stats > entries does not work for record index. > This PR changes the rollback mechanism to actually rollback deltacommits on > the MDT. This makes the rollback handing faster and keeps the MDT in sync > with dataset. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1887) Make schema post processor's default as disabled
[ https://issues.apache.org/jira/browse/HUDI-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-1887: - Fix Version/s: 0.14.1 (was: 0.14.0) > Make schema post processor's default as disabled > > > Key: HUDI-1887 > URL: https://issues.apache.org/jira/browse/HUDI-1887 > Project: Apache Hudi > Issue Type: Improvement > Components: spark >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Labels: core-flow-ds, pull-request-available, sev:high, triaged > Fix For: 0.14.1 > > > With default value [fix|https://github.com/apache/hudi/pull/2765], schema > post processor is not required as mandatory. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5552) Too slow while using trino-hudi connector while querying partitioned tables.
[ https://issues.apache.org/jira/browse/HUDI-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-5552: - Fix Version/s: 0.14.1 (was: 0.14.0) > Too slow while using trino-hudi connector while querying partitioned tables. > > > Key: HUDI-5552 > URL: https://issues.apache.org/jira/browse/HUDI-5552 > Project: Apache Hudi > Issue Type: Bug > Components: trino-presto >Reporter: Danny Chen >Assignee: Sagar Sumit >Priority: Critical > Fix For: 0.14.1 > > > See the issue for details: [[SUPPORT] Too slow while using trino-hudi > connector while querying partitioned tables. · Issue #7643 · apache/hudi > (github.com)|https://github.com/apache/hudi/issues/7643] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4123: - Fix Version/s: 0.14.1 (was: 0.14.0) > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null",
[jira] [Updated] (HUDI-5490) Investigate test failures w/ record level index for existing tests
[ https://issues.apache.org/jira/browse/HUDI-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-5490: - Fix Version/s: 0.14.1 (was: 0.14.0) > Investigate test failures w/ record level index for existing tests > -- > > Key: HUDI-5490 > URL: https://issues.apache.org/jira/browse/HUDI-5490 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata >Reporter: sivabalan narayanan >Assignee: Lokesh Jain >Priority: Blocker > Fix For: 0.14.1 > > > Enable record level index for some of the chosen tests (30 to 40) and ensure > they succeed. The parameterized tests covered in the jira are. > TestCOWDataSourceStorage, TestSparkDataSource, TestMORDataSourceStorage, > TestCOWDataSource#testDropInsertDup and > TestHoodieClientOnCopyOnWriteStorage (Sub tests below for > TestHoodieClientOnCopyOnWriteStorage) > Auto commit tests > testDeduplicationOnInsert > testDeduplicationOnUpsert > testInsertsWithHoodieConcatHandle > testDeletes > testUpsertsUpdatePartitionPathGlobalBloom > testSmallInsertHandlingForUpserts > testSmallInsertHandlingForInserts > testDeletesWithDeleteApi -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3026) HoodieAppendhandle may result in duplicate key for hbase index
[ https://issues.apache.org/jira/browse/HUDI-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-3026: - Fix Version/s: 0.14.1 (was: 0.14.0) > HoodieAppendhandle may result in duplicate key for hbase index > -- > > Key: HUDI-3026 > URL: https://issues.apache.org/jira/browse/HUDI-3026 > Project: Apache Hudi > Issue Type: Bug >Reporter: ZiyueGuan >Assignee: ZiyueGuan >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > Problem: a same key may occur in two file group when Hbase index is used. > These two file group will have same FileID prefix. As Hbase index is global, > this is unexpected > How to repro: > We should have a table w/o record sorted in spark. Let's say we have five > records with key 1,2,3,4,5 to write. They may be iterated in different order. > In the first attempt 1, we write three records 5,4,3 to > fileID_1_log.1_attempt1. But this attempt failed. Spark will have a try in > the second task attempt (attempt 2), we write four records 1,2,3,4 to > fileID_1_log.1_attempt2. And then, we find this filegroup is large enough by > call canWrite. So hudi write record 5 to fileID_2_log.1_attempt2 and finish > this commit. > When we do compaction, fileID_1_log.1_attempt1 and fileID_1_log.1_attempt2 > will be compacted. And we finally got 543 + 1234 = 12345 in fileID_1 while we > also got 5 in fileID_2. Record 5 will appear in two fileGroup. > Reason: Markerfile doesn't reconcile log file as code show in > [https://github.com/apache/hudi/blob/9a2030ab3190acf600ce4820be9a08929595763e/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java#L553.] > And log file is actually not fail-safe. > I'm not sure if [~danny0405] have found this problem too as I find > FlinkAppendHandle had been made to always return true. But it was just > changed back recently. > Solution: > We may have a quick fix by making canWrite in HoodieAppendHandle always > return true. However, I think there may be a more elegant solution that we > use append result to generate compaction plan rather than list log file, in > which we will have a more granular control on log block instead of log file. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4701) Support bulk insert without primary key and precombine field
[ https://issues.apache.org/jira/browse/HUDI-4701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4701: - Fix Version/s: 0.14.1 (was: 0.14.0) > Support bulk insert without primary key and precombine field > > > Key: HUDI-4701 > URL: https://issues.apache.org/jira/browse/HUDI-4701 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Critical > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4585) Optimize query performance on Presto Hudi connector
[ https://issues.apache.org/jira/browse/HUDI-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4585: - Fix Version/s: 0.14.1 (was: 0.14.0) > Optimize query performance on Presto Hudi connector > > > Key: HUDI-4585 > URL: https://issues.apache.org/jira/browse/HUDI-4585 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6163) Add PR size labeler to Hudi repo
[ https://issues.apache.org/jira/browse/HUDI-6163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-6163: - Fix Version/s: 0.14.1 (was: 0.14.0) > Add PR size labeler to Hudi repo > > > Key: HUDI-6163 > URL: https://issues.apache.org/jira/browse/HUDI-6163 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Minor > Labels: pull-request-available > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4294) Introduce build action to actually perform index data generation
[ https://issues.apache.org/jira/browse/HUDI-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4294: - Fix Version/s: 0.14.1 (was: 0.14.0) > Introduce build action to actually perform index data generation > > > Key: HUDI-4294 > URL: https://issues.apache.org/jira/browse/HUDI-4294 > Project: Apache Hudi > Issue Type: New Feature >Reporter: shibei >Assignee: shibei >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > In this issue, we introduce a new action type called build to actually > perform index data generation. This action contains two steps as clustering > action does: > # Generate action plan to clarify which files and which indexes need to be > built; > # Execute build index according action plan generated by step one; > > Call procedure will be implemented as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6139) Add support for Transformer schema validation in DeltaStreamer
[ https://issues.apache.org/jira/browse/HUDI-6139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-6139: - Fix Version/s: 0.14.1 (was: 0.14.0) > Add support for Transformer schema validation in DeltaStreamer > -- > > Key: HUDI-6139 > URL: https://issues.apache.org/jira/browse/HUDI-6139 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > Add a new API in Transformer to provide target schema after transformation. > The new API can then be used to validate if schema of transformed data > matches the expected schema of transformer. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5323) Decouple virtual key with writing bloom filters to parquet files
[ https://issues.apache.org/jira/browse/HUDI-5323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-5323: - Fix Version/s: 0.14.1 (was: 0.14.0) > Decouple virtual key with writing bloom filters to parquet files > > > Key: HUDI-5323 > URL: https://issues.apache.org/jira/browse/HUDI-5323 > Project: Apache Hudi > Issue Type: Improvement > Components: index, writer-core >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > > When the virtual key feature is enabled by setting > hoodie.populate.meta.fields to false, the bloom filters are not written to > parquet base files in the write transactions. Relevant logic in > HoodieFileWriterFactory class: > {code:java} > private static > HoodieFileWriter newParquetFileWriter( > String instantTime, Path path, HoodieWriteConfig config, Schema schema, > HoodieTable hoodieTable, > TaskContextSupplier taskContextSupplier, boolean populateMetaFields) > throws IOException { > return newParquetFileWriter(instantTime, path, config, schema, > hoodieTable.getHadoopConf(), > taskContextSupplier, populateMetaFields, populateMetaFields); > } > private static > HoodieFileWriter newParquetFileWriter( > String instantTime, Path path, HoodieWriteConfig config, Schema schema, > Configuration conf, > TaskContextSupplier taskContextSupplier, boolean populateMetaFields, > boolean enableBloomFilter) throws IOException { > Option filter = enableBloomFilter ? > Option.of(createBloomFilter(config)) : Option.empty(); > HoodieAvroWriteSupport writeSupport = new HoodieAvroWriteSupport(new > AvroSchemaConverter(conf).convert(schema), schema, filter); > HoodieParquetConfig parquetConfig = new > HoodieParquetConfig<>(writeSupport, config.getParquetCompressionCodec(), > config.getParquetBlockSize(), config.getParquetPageSize(), > config.getParquetMaxFileSize(), > conf, config.getParquetCompressionRatio(), > config.parquetDictionaryEnabled()); > return new HoodieAvroParquetWriter<>(path, parquetConfig, instantTime, > taskContextSupplier, populateMetaFields); > } {code} > Given that bloom filters are absent, when using Bloom Index on the same > table, the writer encounters NPE (HUDI-5319). > We should decouple the virtual key feature with bloom filter and always write > the bloom filters to the parquet files. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2754) Performance improvement for IncrementalRelation
[ https://issues.apache.org/jira/browse/HUDI-2754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-2754: - Fix Version/s: 0.14.1 (was: 0.14.0) > Performance improvement for IncrementalRelation > --- > > Key: HUDI-2754 > URL: https://issues.apache.org/jira/browse/HUDI-2754 > Project: Apache Hudi > Issue Type: Improvement > Components: incremental-query, performance >Reporter: Jintao >Assignee: Jintao >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > > When HoodieIncrSource is used to fetch the update from another Hudi table, > the IncrementalRelation will be used to read the data. But it has a > performance issue because the column pruning and predicate pushdown don't > happen. As the result, Hudi reads too much useless data. > By enabling the column pruning and predicate pushdown, the data to read is > reduced dramatically. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-541) Replace variables/comments named "data files" to "base file"
[ https://issues.apache.org/jira/browse/HUDI-541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-541: Fix Version/s: 0.14.1 (was: 0.14.0) > Replace variables/comments named "data files" to "base file" > > > Key: HUDI-541 > URL: https://issues.apache.org/jira/browse/HUDI-541 > Project: Apache Hudi > Issue Type: Improvement > Components: code-quality, dev-experience >Reporter: Vinoth Chandar >Assignee: Pratyaksh Sharma >Priority: Major > Labels: new-to-hudi, pull-request-available > Fix For: 0.14.1 > > > Per cWiki design and arch page, we should converge on the same terminology.. > We have _HoodieBaseFile_.. we should ensure all variables of this type are > named _baseFile_ or _bf_ , as opposed to _dataFile_ or _df_. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3541) Array of struct or Struct of Array AvroConversion issue
[ https://issues.apache.org/jira/browse/HUDI-3541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-3541: - Fix Version/s: 0.14.1 (was: 0.14.0) > Array of struct or Struct of Array AvroConversion issue > > > Key: HUDI-3541 > URL: https://issues.apache.org/jira/browse/HUDI-3541 > Project: Apache Hudi > Issue Type: Task > Components: writer-core >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Critical > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4954) Shade avro in all bundles where it is included
[ https://issues.apache.org/jira/browse/HUDI-4954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4954: - Fix Version/s: 0.14.1 (was: 0.14.0) > Shade avro in all bundles where it is included > --- > > Key: HUDI-4954 > URL: https://issues.apache.org/jira/browse/HUDI-4954 > Project: Apache Hudi > Issue Type: Task > Components: dependencies >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > > [https://github.com/apache/hudi/issues/6829] > Shading in some but not all bundles leads to class conflict if those bundles > are in classpath. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-96) Use Command line options instead of positional arguments when launching spark applications from various CLI commands
[ https://issues.apache.org/jira/browse/HUDI-96?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-96: --- Fix Version/s: 0.14.1 (was: 0.14.0) > Use Command line options instead of positional arguments when launching spark > applications from various CLI commands > > > Key: HUDI-96 > URL: https://issues.apache.org/jira/browse/HUDI-96 > Project: Apache Hudi > Issue Type: Task > Components: cli >Reporter: Balaji Varadarajan >Assignee: Pratyaksh Sharma >Priority: Major > Labels: new-to-hudi, newbie, pull-request-available, sev:normal > Fix For: 0.14.1 > > Time Spent: 20m > Remaining Estimate: 0h > > Hoodie CLI commands like compaction/rollback/repair/savepoints/parquet-import > relies on launching a spark application to perform their operations (look at > SparkMain.java). > SparkMain (Look at SparkMain.main()) relies on positional arguments for > passing various CLI options. Instead we should define proper CLI options in > SparkMain and use them (using Jcommander) to improve readability and avoid > accidental errors at call sites. For e.g : See > com.uber.hoodie.utilities.HoodieCompactor -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3676) Enhance tests for triggering clean every Nth commit
[ https://issues.apache.org/jira/browse/HUDI-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-3676: - Fix Version/s: 0.14.1 (was: 0.14.0) > Enhance tests for triggering clean every Nth commit > > > Key: HUDI-3676 > URL: https://issues.apache.org/jira/browse/HUDI-3676 > Project: Apache Hudi > Issue Type: Test > Components: cleaning, tests-ci >Reporter: sivabalan narayanan >Assignee: Pratyaksh Sharma >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > [PR-4385|https://github.com/apache/hudi/pull/4385] > We need to enhance tests for this new feature. i.e. trigger clean every Nth > commit. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5423) Flaky test: ColumnStatsTestCase(MERGE_ON_READ,true,true)
[ https://issues.apache.org/jira/browse/HUDI-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-5423: - Fix Version/s: 0.14.1 (was: 0.14.0) > Flaky test: ColumnStatsTestCase(MERGE_ON_READ,true,true) > > > Key: HUDI-5423 > URL: https://issues.apache.org/jira/browse/HUDI-5423 > Project: Apache Hudi > Issue Type: Test > Components: tests-ci >Reporter: Raymond Xu >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.14.1 > > > {code} > [ERROR] Tests run: 94, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: > 1,729.267 s <<< FAILURE! - in JUnit Vintage > [ERROR] [8] > ColumnStatsTestCase(MERGE_ON_READ,true,true)(testMetadataColumnStatsIndex(ColumnStatsTestCase)) > Time elapsed: 23.246 s <<< FAILURE! > org.opentest4j.AssertionFailedError: > expected: > <{"c1_maxValue":101,"c1_minValue":101,"c1_nullCount":0,"c2_maxValue":" > 999sdc","c2_minValue":" > 999sdc","c2_nullCount":0,"c3_maxValue":10.329,"c3_minValue":10.329,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.179Z","c4_minValue":"2021-11-19T07:34:44.179Z","c4_nullCount":0,"c5_maxValue":99,"c5_minValue":99,"c5_nullCount":0,"c6_maxValue":"2020-03-28","c6_minValue":"2020-03-28","c6_nullCount":0,"c7_maxValue":"SA==","c7_minValue":"SA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":1} > {"c1_maxValue":562,"c1_minValue":323,"c1_nullCount":0,"c2_maxValue":" > 984sdc","c2_minValue":" > 980sdc","c2_nullCount":0,"c3_maxValue":977.328,"c3_minValue":64.768,"c3_nullCount":1,"c4_maxValue":"2021-11-19T07:34:44.201Z","c4_minValue":"2021-11-19T07:34:44.181Z","c4_nullCount":0,"c5_maxValue":78,"c5_minValue":34,"c5_nullCount":0,"c6_maxValue":"2020-10-21","c6_minValue":"2020-01-15","c6_nullCount":0,"c7_maxValue":"SA==","c7_minValue":"qw==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":4} > {"c1_maxValue":568,"c1_minValue":8,"c1_nullCount":0,"c2_maxValue":" > 8sdc","c2_minValue":" > 111sdc","c2_nullCount":0,"c3_maxValue":979.272,"c3_minValue":82.111,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.193Z","c4_minValue":"2021-11-19T07:34:44.159Z","c4_nullCount":0,"c5_maxValue":58,"c5_minValue":2,"c5_nullCount":0,"c6_maxValue":"2020-11-08","c6_minValue":"2020-01-01","c6_nullCount":0,"c7_maxValue":"9g==","c7_minValue":"Ag==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":15} > {"c1_maxValue":619,"c1_minValue":619,"c1_nullCount":0,"c2_maxValue":" > 985sdc","c2_minValue":" > 985sdc","c2_nullCount":0,"c3_maxValue":230.320,"c3_minValue":230.320,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.180Z","c4_minValue":"2021-11-19T07:34:44.180Z","c4_nullCount":0,"c5_maxValue":33,"c5_minValue":33,"c5_nullCount":0,"c6_maxValue":"2020-02-13","c6_minValue":"2020-02-13","c6_nullCount":0,"c7_maxValue":"QA==","c7_minValue":"QA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":1} > {"c1_maxValue":633,"c1_minValue":624,"c1_nullCount":0,"c2_maxValue":" > 987sdc","c2_minValue":" > 986sdc","c2_nullCount":0,"c3_maxValue":580.317,"c3_minValue":375.308,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.180Z","c4_minValue":"2021-11-19T07:34:44.180Z","c4_nullCount":0,"c5_maxValue":33,"c5_minValue":32,"c5_nullCount":0,"c6_maxValue":"2020-10-10","c6_minValue":"2020-01-01","c6_nullCount":0,"c7_maxValue":"PQ==","c7_minValue":"NA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":2} > {"c1_maxValue":639,"c1_minValue":555,"c1_nullCount":0,"c2_maxValue":" > 989sdc","c2_minValue":" > 982sdc","c2_nullCount":0,"c3_maxValue":904.304,"c3_minValue":153.431,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.186Z","c4_minValue":"2021-11-19T07:34:44.179Z","c4_nullCount":0,"c5_maxValue":44,"c5_minValue":31,"c5_nullCount":0,"c6_maxValue":"2020-08-25","c6_minValue":"2020-03-12","c6_nullCount":0,"c7_maxValue":"MA==","c7_minValue":"rw==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":3} > {"c1_maxValue":715,"c1_minValue":76,"c1_nullCount":0,"c2_maxValue":" > 76sdc","c2_minValue":" > 224sdc","c2_nullCount":0,"c3_maxValue":958.579,"c3_minValue":246.427,"c3_nullCount":0,"c4_maxValue":"2021-11-19T07:34:44.199Z","c4_minValue":"2021-11-19T07:34:44.166Z","c4_nullCount":0,"c5_maxValue":73,"c5_minValue":9,"c5_nullCount":0,"c6_maxValue":"2020-11-21","c6_minValue":"2020-01-16","c6_nullCount":0,"c7_maxValue":"+g==","c7_minValue":"LA==","c7_nullCount":0,"c8_maxValue":9,"c8_minValue":9,"c8_nullCount":0,"valueCount":12} > {"c1_maxValue":768,"c1_minValue":59,"c1_nullCount":0,"c2_maxValue":" > 768sdc","c2_minValue":" >
[jira] [Updated] (HUDI-3617) MOR compact improve
[ https://issues.apache.org/jira/browse/HUDI-3617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-3617: - Fix Version/s: 0.14.1 (was: 0.14.0) > MOR compact improve > --- > > Key: HUDI-3617 > URL: https://issues.apache.org/jira/browse/HUDI-3617 > Project: Apache Hudi > Issue Type: Improvement > Components: compaction, writer-core >Reporter: scx >Assignee: scx >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > > In most business scenarios, the latest data is in the latest delta log file, > so we sort it from large to small according to the instance time, which can > largely avoid rewriting the data in the compact process, and then optimize > the compact time -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6317) Streaming read should skip compaction and clustering instants to avoid duplicates
[ https://issues.apache.org/jira/browse/HUDI-6317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-6317: - Fix Version/s: 0.14.1 (was: 0.14.0) > Streaming read should skip compaction and clustering instants to avoid > duplicates > - > > Key: HUDI-6317 > URL: https://issues.apache.org/jira/browse/HUDI-6317 > Project: Apache Hudi > Issue Type: Bug > Components: flink >Reporter: Nicholas Jiang >Assignee: Nicholas Jiang >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > At present, the default value of read.streaming.skip_clustering is false, > which could cause the situation that streaming reading reads the replaced > file slices of clustering, so that streaming reading may read T-1 day data > when clustering the data of T-1 day to cause duplicated data. Therefore > streaming read should skip clustering instants for all cases to avoid reading > the replaced file slices. Same to `read.streaming.skip_compaction`. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1628) [Umbrella] Improve data locality during ingestion
[ https://issues.apache.org/jira/browse/HUDI-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-1628: - Fix Version/s: 0.14.1 (was: 0.14.0) (was: 1.1.0) > [Umbrella] Improve data locality during ingestion > - > > Key: HUDI-1628 > URL: https://issues.apache.org/jira/browse/HUDI-1628 > Project: Apache Hudi > Issue Type: Epic > Components: writer-core >Reporter: satish >Assignee: Ethan Guo >Priority: Major > Labels: hudi-umbrellas > Fix For: 0.14.1 > > > Today the upsert partitioner does the file sizing/bin-packing etc for > inserts and then sends some inserts over to existing file groups to > maintain file size. > We can abstract all of this into strategies and some kind of pipeline > abstractions and have it also consider "affinity" to an existing file group > based > on say information stored in the metadata table? > See http://mail-archives.apache.org/mod_mbox/hudi-dev/202102.mbox/browser > for more details -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3555) re-use spark config for parquet timestamp format instead of having our own config
[ https://issues.apache.org/jira/browse/HUDI-3555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-3555: - Fix Version/s: 0.14.1 (was: 0.14.0) > re-use spark config for parquet timestamp format instead of having our own > config > - > > Key: HUDI-3555 > URL: https://issues.apache.org/jira/browse/HUDI-3555 > Project: Apache Hudi > Issue Type: Improvement > Components: spark >Reporter: sivabalan narayanan >Assignee: liujinhui >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > We have two diff configs to set the right timestamp format. > "hoodie.parquet.outputtimestamptype": "TIMESTAMP_MICROS", > and spark config > --conf spark.sql.parquet.outputTimestampType=TIMESTAMP_MICROS > > We should deprecate our own config and just rely on spark's configs. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4854) Deltastreamer does not respect partition selector regex for metadata-only bootstrap
[ https://issues.apache.org/jira/browse/HUDI-4854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4854: - Fix Version/s: 0.14.1 (was: 0.14.0) > Deltastreamer does not respect partition selector regex for metadata-only > bootstrap > --- > > Key: HUDI-4854 > URL: https://issues.apache.org/jira/browse/HUDI-4854 > Project: Apache Hudi > Issue Type: Bug > Components: bootstrap >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6105) Partial update for MERGE INTO
[ https://issues.apache.org/jira/browse/HUDI-6105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-6105: - Fix Version/s: 0.14.1 (was: 0.14.0) > Partial update for MERGE INTO > - > > Key: HUDI-6105 > URL: https://issues.apache.org/jira/browse/HUDI-6105 > Project: Apache Hudi > Issue Type: New Feature > Components: spark-sql >Reporter: Danny Chen >Assignee: Jing Zhang >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1864) Support for java.time.LocalDate in TimestampBasedAvroKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-1864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-1864: - Fix Version/s: 0.14.1 (was: 0.14.0) > Support for java.time.LocalDate in TimestampBasedAvroKeyGenerator > - > > Key: HUDI-1864 > URL: https://issues.apache.org/jira/browse/HUDI-1864 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vaibhav Sinha >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available, query-eng, sev:high > Fix For: 0.14.1 > > > When we read data from MySQL which has a column of type {{Date}}, Spark > represents it as an instance of {{java.time.LocalDate}}. If I try and use > this column for partitioning while doing a write to Hudi, I get the following > exception > > {code:java} > Caused by: org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to > parse input partition field :2021-04-21 > at > org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:136) > ~[hudi-spark3-bundle_2.12-0.8.0.jar:0.8.0] > at > org.apache.hudi.keygen.CustomAvroKeyGenerator.getPartitionPath(CustomAvroKeyGenerator.java:89) > ~[hudi-spark3-bundle_2.12-0.8.0.jar:0.8.0] > at > org.apache.hudi.keygen.CustomKeyGenerator.getPartitionPath(CustomKeyGenerator.java:64) > ~[hudi-spark3-bundle_2.12-0.8.0.jar:0.8.0] > at > org.apache.hudi.keygen.BaseKeyGenerator.getKey(BaseKeyGenerator.java:62) > ~[hudi-spark3-bundle_2.12-0.8.0.jar:0.8.0] > at > org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$write$2(HoodieSparkSqlWriter.scala:160) > ~[hudi-spark3-bundle_2.12-0.8.0.jar:0.8.0] > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > ~[scala-library-2.12.10.jar:?] > at scala.collection.Iterator$SliceIterator.next(Iterator.scala:271) > ~[scala-library-2.12.10.jar:?] > at scala.collection.Iterator.foreach(Iterator.scala:941) > ~[scala-library-2.12.10.jar:?] > at scala.collection.Iterator.foreach$(Iterator.scala:941) > ~[scala-library-2.12.10.jar:?] > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > ~[scala-library-2.12.10.jar:?] > at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) > ~[scala-library-2.12.10.jar:?] > at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) > ~[scala-library-2.12.10.jar:?] > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) > ~[scala-library-2.12.10.jar:?] > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) > ~[scala-library-2.12.10.jar:?] > at scala.collection.TraversableOnce.to(TraversableOnce.scala:315) > ~[scala-library-2.12.10.jar:?] > at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313) > ~[scala-library-2.12.10.jar:?] > at scala.collection.AbstractIterator.to(Iterator.scala:1429) > ~[scala-library-2.12.10.jar:?] > at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307) > ~[scala-library-2.12.10.jar:?] > at > scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307) > ~[scala-library-2.12.10.jar:?] > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429) > ~[scala-library-2.12.10.jar:?] > at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294) > ~[scala-library-2.12.10.jar:?] > at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288) > ~[scala-library-2.12.10.jar:?] > at scala.collection.AbstractIterator.toArray(Iterator.scala:1429) > ~[scala-library-2.12.10.jar:?] > at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1449) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2242) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at org.apache.spark.scheduler.Task.run(Task.scala:131) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) > ~[spark-core_2.12-3.1.1.jar:3.1.1] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > ~[?:1.8.0_171] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > ~[?:1.8.0_171] > at java.lang.Thread.run(Thread.java:748)
[jira] [Updated] (HUDI-5464) Fix instantiation of a new partition in MDT re-using the same instant time as a regular commit
[ https://issues.apache.org/jira/browse/HUDI-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-5464: - Fix Version/s: 0.14.1 (was: 0.14.0) > Fix instantiation of a new partition in MDT re-using the same instant time as > a regular commit > -- > > Key: HUDI-5464 > URL: https://issues.apache.org/jira/browse/HUDI-5464 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: sivabalan narayanan >Assignee: Raymond Xu >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > > we re-use the same instant time as the commit being applied to MDT while > instantiating a new partition in MDT. this needs to be fixed. > > for eg, lets say we have 10 commits w/ already FILES enabled. > for C11, we are enabling col-stats. > after data table business, when we enter metadata writer instantiation, we > deduct that col-stats has to be instantiated and then instantiate using DC11. > in MDT timeline, we see dc11.req. dc11.inflight and dc11.complete. and then > we go ahead and apply actual C11 from DT to MDT (dc11.inflight and > dc11.complete is updated). here, we overwrite the same DC11 w/ records > pertaining to C11. > which is buggy. we definitely need to fix this. > We can add a suffix to C11 (say C11_003 or C11_001) as we do for compaction > and clean in MDT so that any additional operation in MDT has a diff commit > time format. For everything else, it should match w/ DT 1 on 1. > > > Impact: > We are over-riding the same DC for two purposes which is bad. if there is a > crash after initializing col-stats and before applying actual C11(in above > context), we might mistakenly rollback col-stats initialization, but still > table config could say that col stats is fully ready to be served. But while > reading MDT, we may not read DC11 since its a failed commit. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6126) Fix test testInsertDatasetWIthTimelineTimezoneUTC to not block CI
[ https://issues.apache.org/jira/browse/HUDI-6126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-6126: - Fix Version/s: 0.14.1 (was: 0.14.0) > Fix test testInsertDatasetWIthTimelineTimezoneUTC to not block CI > - > > Key: HUDI-6126 > URL: https://issues.apache.org/jira/browse/HUDI-6126 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5997) Support DFS Schema Provider with S3/GCS EventsHoodieIncrSource
[ https://issues.apache.org/jira/browse/HUDI-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-5997: - Fix Version/s: 0.14.1 (was: 0.14.0) > Support DFS Schema Provider with S3/GCS EventsHoodieIncrSource > -- > > Key: HUDI-5997 > URL: https://issues.apache.org/jira/browse/HUDI-5997 > Project: Apache Hudi > Issue Type: Improvement > Components: deltastreamer >Reporter: Sagar Sumit >Assignee: Léo Biscassi >Priority: Major > Fix For: 0.14.1 > > > See for more details -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2681) Make hoodie record_key and preCombine_key optional
[ https://issues.apache.org/jira/browse/HUDI-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-2681: - Fix Version/s: 0.14.1 (was: 0.14.0) > Make hoodie record_key and preCombine_key optional > -- > > Key: HUDI-2681 > URL: https://issues.apache.org/jira/browse/HUDI-2681 > Project: Apache Hudi > Issue Type: New Feature > Components: Common Core, spark-sql, writer-core >Reporter: Vinoth Govindarajan >Assignee: sivabalan narayanan >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > > At present, Hudi needs an record key and preCombine key to create an Hudi > datasets, which puts an restriction on the kinds of datasets we can create > using Hudi. > > In order to increase the adoption of Hudi file format across all kinds of > derived datasets, similar to Parquet/ORC, we need to offer flexibility to > users. I understand that record key is used for upsert primitive and we need > preCombine key to break the tie and deduplicate, but there are event data and > other datasets without any primary key (append only datasets), which can > benefit from Hudi since Hudi ecosystem offers other features such as snapshot > isolation, indexes, clustering, delta streamer etc., which could be applied > to any datasets without record key. > > The idea of this proposal is to make both the record key and preCombine key > optional to allow variety of new use cases on top of Hudi. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1120) Support spotless for scala
[ https://issues.apache.org/jira/browse/HUDI-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-1120: - Fix Version/s: 0.14.1 (was: 0.14.0) > Support spotless for scala > -- > > Key: HUDI-1120 > URL: https://issues.apache.org/jira/browse/HUDI-1120 > Project: Apache Hudi > Issue Type: Improvement > Components: code-quality >Reporter: Yanjia Gary Li >Assignee: Yanjia Gary Li >Priority: Major > Labels: pull-request-available, sev:normal, user-support-issues > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2151) Make performant out-of-box configs
[ https://issues.apache.org/jira/browse/HUDI-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-2151: - Fix Version/s: 0.14.1 (was: 0.14.0) > Make performant out-of-box configs > -- > > Key: HUDI-2151 > URL: https://issues.apache.org/jira/browse/HUDI-2151 > Project: Apache Hudi > Issue Type: Improvement > Components: code-quality, docs, writer-core >Reporter: Vinoth Chandar >Assignee: sivabalan narayanan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.1 > > Original Estimate: 2h > Remaining Estimate: 2h > > We have quite a few configs which deliver better performance or usability, > but guarded by flags. > This is to identify them, change them, test (functionally, perf) and make > them default > > Need to ensure we also capture all the backwards compatibility issues that > can arise -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6112) Improve Doc generatiion to generate config tables for basic and advanced configs
[ https://issues.apache.org/jira/browse/HUDI-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-6112: - Fix Version/s: 0.14.1 (was: 0.14.0) > Improve Doc generatiion to generate config tables for basic and advanced > configs > > > Key: HUDI-6112 > URL: https://issues.apache.org/jira/browse/HUDI-6112 > Project: Apache Hudi > Issue Type: Task >Reporter: Bhavani Sudha >Assignee: Bhavani Sudha >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > The HoodieConfigDocGenerator will need to be modified such that: > * Each config group has two sections: basic configs and advanced configs > * Basic configs and Advanced configs are played out in a table instead of a > serially like today. > * Among each of these tables the required configs are bubbled up to the top > of the table and highlighted. > Add UI fixes to support a table layout -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6075) Improve config generation script and docs
[ https://issues.apache.org/jira/browse/HUDI-6075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-6075: - Fix Version/s: 0.14.1 (was: 0.14.0) > Improve config generation script and docs > - > > Key: HUDI-6075 > URL: https://issues.apache.org/jira/browse/HUDI-6075 > Project: Apache Hudi > Issue Type: New Feature > Components: configs >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6116) Optimize log block reading by removing seeks to check corrupted blocks
[ https://issues.apache.org/jira/browse/HUDI-6116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-6116: - Fix Version/s: 0.14.1 (was: 0.14.0) > Optimize log block reading by removing seeks to check corrupted blocks > -- > > Key: HUDI-6116 > URL: https://issues.apache.org/jira/browse/HUDI-6116 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Prashant Wason >Assignee: Prashant Wason >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > The code currently does an eager isCorruptedCheck for which we do a seek and > then a read which invalidates our internal buffers in opened file stream to > the log file and makes a call to DataNode to start a new blockReader. > The seek + read becomes apparent when we do cross datacenter reads or where > the latency to the file is HIGH. In cases, a single RPC will cost us about > 120ms + Cost of RPC (west coast to east coast) so this seek is bad for > performance. > Delaying the corrupt check also gives us many benefits in low latency env > where we see times reducing from (5 to 8 sec) to (3s to < 500ms) for a > moderately sized files of 250MB. > NOTE: The more number of log blocks to read, the greater the performance > improvements. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4569) [RFC-59] Multiple event_time fields latest verification in a single table
[ https://issues.apache.org/jira/browse/HUDI-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4569: - Fix Version/s: 0.14.1 (was: 0.14.0) (was: 1.1.0) > [RFC-59] Multiple event_time fields latest verification in a single table > - > > Key: HUDI-4569 > URL: https://issues.apache.org/jira/browse/HUDI-4569 > Project: Apache Hudi > Issue Type: Epic >Reporter: Richard Xinyao Tian >Assignee: Richard Xinyao Tian >Priority: Major > Labels: features, pull-request-available > Fix For: 0.14.1 > > > This Jira tracks all sub-tasks related to RFC-59, which would help Hudi to > achieve a new feature highly demanded by Finance related industries, > temporarily named "Multiple event_time fields latest verification in a single > table". This feature would improve Hudi to have the ability to verify > multiple event_time fields as the latest, thus could enable Hudi to support > scenarios where complex Join operations have been executed. > We're very keen to make this new feature available to everyone. Since we > benefit from the Hudi community, so we really desire to give back to the > community with our efforts. > For those who are interested in why this feature is desperately desired, > please move to the new RFC Request discussion, which could be found in the > link below: [https://lists.apache.org/thread/dlkgn1knknhl3z2gwtvchd618tj399z9] > > Stages Management: > # Briefly illustrate this idea in the dev-maillist and discuss it with > comitters [Done] > # Raise a PR to claim a RFC-number for further steps [Done] > # Write formal RFC materials to describe design and code implementation > [Done] > # Finish Writing RFC materials and request a PR for submitting them [Done] > # Wait for comments from at least two PMCs and confirm of approval [Current > Stage] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5175) Improving FileIndex load performance in PARALLELISM mode
[ https://issues.apache.org/jira/browse/HUDI-5175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-5175: - Fix Version/s: 0.14.1 (was: 0.14.0) > Improving FileIndex load performance in PARALLELISM mode > > > Key: HUDI-5175 > URL: https://issues.apache.org/jira/browse/HUDI-5175 > Project: Apache Hudi > Issue Type: Improvement > Components: index >Reporter: Yue Zhang >Assignee: Yue Zhang >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3068) Add support to sync all partitions in hive sync tool
[ https://issues.apache.org/jira/browse/HUDI-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-3068: - Fix Version/s: 0.14.1 (was: 0.14.0) > Add support to sync all partitions in hive sync tool > > > Key: HUDI-3068 > URL: https://issues.apache.org/jira/browse/HUDI-3068 > Project: Apache Hudi > Issue Type: New Feature > Components: meta-sync >Reporter: sivabalan narayanan >Assignee: Harshal Patil >Priority: Blocker > Labels: pull-request-available, sev:critical > Fix For: 0.14.1 > > > If a user runs hive sync occationally and if archival kicked in and trimmed > some commits and if there were partitions added during those commits which > was never updated later, hive sync will miss out those partitions. > {code:java} > LOG.info("Last commit time synced is " + lastCommitTimeSynced.get() + ", > Getting commits since then"); > return > TimelineUtils.getPartitionsWritten(metaClient.getActiveTimeline().getCommitsTimeline() > .findInstantsAfter(lastCommitTimeSynced.get(), Integer.MAX_VALUE)); > } {code} > bcoz, we for recurrent syncs, we always fetch new commits from timeline after > the last synced instant and fetch commit metadata and go on to fetch the > partitions added as part of it. > > We can add a new config to hive sync tool to override this behavior. > --sync-all-partitions > when this config is set to true, we should ignore last synced instant and > should go the below route which is done when syncing for the first time. > > {code:java} > if (!lastCommitTimeSynced.isPresent()) { > LOG.info("Last commit time synced is not known, listing all partitions in " > + basePath + ",FS :" + fs); > HoodieLocalEngineContext engineContext = new > HoodieLocalEngineContext(metaClient.getHadoopConf()); > return FSUtils.getAllPartitionPaths(engineContext, basePath, > useFileListingFromMetadata, assumeDatePartitioning); > } {code} > > > Ref issue: > https://github.com/apache/hudi/issues/3890 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5131) Bundle validation: upgrade/downgrade
[ https://issues.apache.org/jira/browse/HUDI-5131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-5131: - Fix Version/s: 0.14.1 (was: 0.14.0) > Bundle validation: upgrade/downgrade > > > Key: HUDI-5131 > URL: https://issues.apache.org/jira/browse/HUDI-5131 > Project: Apache Hudi > Issue Type: Test > Components: tests-ci >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2517) Simplify the amount of configs that need to be passed in for Delta Streamer
[ https://issues.apache.org/jira/browse/HUDI-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-2517: - Fix Version/s: 0.14.1 (was: 0.14.0) > Simplify the amount of configs that need to be passed in for Delta Streamer > --- > > Key: HUDI-2517 > URL: https://issues.apache.org/jira/browse/HUDI-2517 > Project: Apache Hudi > Issue Type: Improvement > Components: code-quality, configs >Reporter: Vinoth Chandar >Assignee: Sagar Sumit >Priority: Major > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4937) Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT readers
[ https://issues.apache.org/jira/browse/HUDI-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4937: - Fix Version/s: 0.14.1 (was: 0.14.0) > Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT > readers > - > > Key: HUDI-4937 > URL: https://issues.apache.org/jira/browse/HUDI-4937 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core, writer-core >Affects Versions: 0.12.0 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > > Currently, `HoodieTable` is holding `HoodieBackedTableMetadata` that is setup > not to reuse actual LogScanner and HFileReader used to read MT itself. > This is proving to be wasteful on a number of occasions already, including > (not an exhaustive list): > https://github.com/apache/hudi/issues/6373 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2832) [Umbrella] [RFC-40] Integrated Hudi with Snowflake
[ https://issues.apache.org/jira/browse/HUDI-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-2832: - Fix Version/s: 0.14.1 (was: 0.14.0) > [Umbrella] [RFC-40] Integrated Hudi with Snowflake > --- > > Key: HUDI-2832 > URL: https://issues.apache.org/jira/browse/HUDI-2832 > Project: Apache Hudi > Issue Type: Epic > Components: Common Core >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Critical > Labels: BigQuery, Integration, pull-request-available > Fix For: 0.14.1 > > > Snowflake is a fully managed service that’s simple to use but can power a > near-unlimited number of concurrent workloads. Snowflake is a solution for > data warehousing, data lakes, data engineering, data science, data > application development, and securely sharing and consuming shared data. > Snowflake [doesn’t > support|https://docs.snowflake.com/en/sql-reference/sql/alter-file-format.html] > Apache Hudi file format yet, but it has support for Parquet, ORC, and Delta > file format. This proposal is to implement a SnowflakeSync similar to > HiveSync to sync the Hudi table as the Snowflake External Parquet table so > that users can query the Hudi tables using Snowflake. Many users have > expressed interest in Hudi and other support channels asking to integrate > Hudi with Snowflake, this will unlock new use cases for Hudi. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5079) Optimize rdd.isEmpty within DeltaSync
[ https://issues.apache.org/jira/browse/HUDI-5079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-5079: - Fix Version/s: 0.14.1 (was: 0.14.0) > Optimize rdd.isEmpty within DeltaSync > - > > Key: HUDI-5079 > URL: https://issues.apache.org/jira/browse/HUDI-5079 > Project: Apache Hudi > Issue Type: Improvement > Components: deltastreamer >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > > We are calling rdd.isEmpty for source rdd twice in DeltaSync. we should try > and optimize/reuse. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2687) [UMBRELLA] A new Trino connector for Hudi
[ https://issues.apache.org/jira/browse/HUDI-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-2687: - Fix Version/s: 0.14.1 (was: 1.0.0) (was: 0.14.0) (was: 0.15.0) > [UMBRELLA] A new Trino connector for Hudi > - > > Key: HUDI-2687 > URL: https://issues.apache.org/jira/browse/HUDI-2687 > Project: Apache Hudi > Issue Type: Epic > Components: trino-presto >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Critical > Labels: hudi-umbrellas > Fix For: 0.14.1 > > Attachments: image-2021-11-05-14-16-57-324.png, > image-2021-11-05-14-17-03-211.png > > > This JIRA tracks all the tasks related to building a new Hudi connector in > Trino. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5092) Querying Hudi table throws NoSuchMethodError in Databricks runtime
[ https://issues.apache.org/jira/browse/HUDI-5092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-5092: - Fix Version/s: 0.14.1 (was: 0.14.0) > Querying Hudi table throws NoSuchMethodError in Databricks runtime > --- > > Key: HUDI-5092 > URL: https://issues.apache.org/jira/browse/HUDI-5092 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.0 >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.14.1 > > Attachments: image (1).png, image.png > > > Originally reported by the user: > [https://github.com/apache/hudi/issues/6137] > > Crux of the issue is that Databricks's DBR runtime diverges from OSS Spark, > and in that case `FileStatusCache` API is very clearly divergent b/w the two. > There are a few approaches we can take: > # Avoid reliance on Spark's FIleStatusCache implementation altogether and > rely on our own one > # Apply more staggered approach where we first try to use Spark's > FileStatusCache and if it doesn't match expected API, we fallback to our own > impl > > Approach # 1 would actually mean that we're not sharing cache implementation > w/ Spark, which in turn would entail that in some cases we might be keeping 2 > instances of the same cache. Approach # 2 remediates that and allows us to > only fallback in case API is not compatible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4287) Optimize Flink checkpoint meta mechanism to fix mistaken pending instants
[ https://issues.apache.org/jira/browse/HUDI-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4287: - Fix Version/s: 0.14.1 (was: 0.14.0) > Optimize Flink checkpoint meta mechanism to fix mistaken pending instants > - > > Key: HUDI-4287 > URL: https://issues.apache.org/jira/browse/HUDI-4287 > Project: Apache Hudi > Issue Type: Bug > Components: flink >Reporter: Shizhi Chen >Assignee: Shizhi Chen >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.1 > > Attachments: image-2022-06-27-19-42-14-676.png, > image-2022-06-27-19-55-20-210.png, image-2022-06-27-20-07-55-984.png, > image-2022-06-27-20-11-47-939.png, image-2022-06-27-20-29-49-897.png > > > *Problem reveiw* > CkpMetadata is introduced into flink module to reduce timeline burden, but > currently its > mechanism lacks corresponding status for rollback instants, which may result > in commit/delta commit instants deletion, and thus > StreamWriteOperatorCoordinator(meta end) and Write function(data end) will > not be coordinatited correctly. > Finally, data files will be deleted by mistake. > This situation will be easy to reproduced especially when > StreamWriteOperatorCoordinator schedules table services for a long time > between commit and init instants after the restoration from a checkpoint. > > *Stable Reproduction Proccedure* > * a. Before starting a job, let's modify the > StreamWriteOperatorCoordinator#notifyCheckpointComplete like: > !image-2022-06-27-19-42-14-676.png|width=479,height=293! > It does nothing but to mock the possible long time table services for fast > reproduction. > * b. Start a simple flink hudi job such as append, and don't hesitate to > kill it when the 2nd checkpoint is in INFLIGHT. > * c. Let's restart it from the checkpoint restoration, it'll be sure to hit > the case after another 2 checkpoints, which may be accompanied by the > FileNotFoundException: > !image-2022-06-27-20-29-49-897.png|width=503,height=386! > More important, we could observe the incoordination: > !image-2022-06-27-20-07-55-984.png|width=517,height=109! > The screenshot above shows that the instant should be 20220531163135119 in > 2022-05-31 16:36 which is committed by StreamWriteOperatorCoordinator as meta > end. > !image-2022-06-27-20-11-47-939.png|width=517,height=155! > At the same time, the data files are written with the wrong base commit > instant: 20220531161923191, which is deleted during rollbacks in procedure c. > for its uncompletement and also should have been evicted from ckp_meta. > > *Solution* > The solution is to optimize the mechanism with CANCELLED CkpMessage state in > the highest priority corresponding with DELETE instant during rollback action. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3957) Evaluate Support for spark2 and scala12
[ https://issues.apache.org/jira/browse/HUDI-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-3957: - Fix Version/s: 0.14.1 (was: 0.14.0) > Evaluate Support for spark2 and scala12 > > > Key: HUDI-3957 > URL: https://issues.apache.org/jira/browse/HUDI-3957 > Project: Apache Hudi > Issue Type: Test > Components: tests-ci >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Minor > Fix For: 0.14.1 > > Attachments: Screen Shot 2022-05-05 at 8.51.11 AM.png, Screen Shot > 2022-05-05 at 8.53.39 AM.png > > Original Estimate: 1h > Remaining Estimate: 1h > > We may need to evaluate the need for supporting spark2 and scala 12 and > deprecate if there is not much usage. > > From the overall stats, hudi-spark_2.12 bundle usage is 2%. Among > hudi_spark2.12 bundle usages, most usages are from 0.7, 0.8. and 0.9. 0.10 > and above are ~ 5% among all usages of hudi-spark2.12 bundle. So, probably we > can deprecate the usage of spark2 and scala12 going forward and ask users to > use spark3. > > !Screen Shot 2022-05-05 at 8.51.11 AM.png! > > > !Screen Shot 2022-05-05 at 8.53.39 AM.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4105) Identify out of the box performance config flips for spark-ds
[ https://issues.apache.org/jira/browse/HUDI-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4105: - Fix Version/s: 0.14.1 (was: 0.14.0) > Identify out of the box performance config flips for spark-ds > - > > Key: HUDI-4105 > URL: https://issues.apache.org/jira/browse/HUDI-4105 > Project: Apache Hudi > Issue Type: Improvement > Components: configs >Reporter: sivabalan narayanan >Priority: Critical > Fix For: 0.14.1 > > > we need to identify out of the box performance flips. Refer to HUDI-2151 for > older ticket. But we need to comb through all configs once again and come up > with an updated list. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4449) Spark: Support DataSourceV2 Read
[ https://issues.apache.org/jira/browse/HUDI-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4449: - Fix Version/s: 0.14.1 (was: 0.14.0) > Spark: Support DataSourceV2 Read > - > > Key: HUDI-4449 > URL: https://issues.apache.org/jira/browse/HUDI-4449 > Project: Apache Hudi > Issue Type: Epic > Components: reader-core, spark, spark-sql >Reporter: chenliang >Assignee: chenliang >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > Introduce v2 reading interface and define {{HoodieBatchScanBuilder}} to > provide querying capability.ColumnPrune & PushDown is follow up. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-956) Test MOR : Presto Realtime Query with metadata bootstrap
[ https://issues.apache.org/jira/browse/HUDI-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-956: Fix Version/s: 0.14.1 (was: 0.14.0) > Test MOR : Presto Realtime Query with metadata bootstrap > > > Key: HUDI-956 > URL: https://issues.apache.org/jira/browse/HUDI-956 > Project: Apache Hudi > Issue Type: Task > Components: trino-presto >Reporter: Balaji Varadarajan >Assignee: Sagar Sumit >Priority: Blocker > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1258) Small file handling Merges can be handled without actual merging
[ https://issues.apache.org/jira/browse/HUDI-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-1258: - Fix Version/s: 0.14.1 (was: 0.14.0) > Small file handling Merges can be handled without actual merging > > > Key: HUDI-1258 > URL: https://issues.apache.org/jira/browse/HUDI-1258 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Assignee: Pratyaksh Sharma >Priority: Major > Labels: hudi-on-call > Fix For: 0.14.1 > > > If a file slice gets inserts into MergeHandle, for file sizing reasons, there > is no reason to really build the hashmap and merge. > > This will also avoid the issue of insert with the same duplicate key > overwriting the previous value -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6282) Config conflict with Deltastreamer CustomKeyGenerator for PartitionPath
[ https://issues.apache.org/jira/browse/HUDI-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-6282: - Fix Version/s: 0.14.1 (was: 0.14.0) > Config conflict with Deltastreamer CustomKeyGenerator for PartitionPath > --- > > Key: HUDI-6282 > URL: https://issues.apache.org/jira/browse/HUDI-6282 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: Aditya Goenka >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > > With the debezium source while using CustomKeyGenerator, there is need to > pass the `:Timestamp` for the partition path for the first run but then it > runs without it. > > Github Issue - [https://github.com/apache/hudi/issues/8372] > > Details to Reproduce the issue- > [https://gist.github.com/ad1happy2go/49b81f015c1a2964fee489214658cf44] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4967) Improve docs for meta sync with TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4967: - Fix Version/s: 0.14.1 (was: 0.14.0) > Improve docs for meta sync with TimestampBasedKeyGenerator > -- > > Key: HUDI-4967 > URL: https://issues.apache.org/jira/browse/HUDI-4967 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > Related fix: HUDI-4966 > We need to add docs on how to properly set the meta sync configuration, > especially the hoodie.datasource.hive_sync.partition_value_extractor, in > [https://hudi.apache.org/docs/key_generation] (for different Hudi versions, > the config can be different). Check the ticket above and PR description of > [https://github.com/apache/hudi/pull/6851] for more details. > We should also add the migration setup on the key generation page as well: > [https://hudi.apache.org/releases/release-0.12.0/#configuration-updates] > * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config > is used to extract and transform partition value during Hive sync. Its > default value has been changed from > {{SlashEncodedDayPartitionValueExtractor}} to > {{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default > value (i.e., have not set it explicitly), you are required to set the config > to {{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From > this release, if this config is not set and Hive sync is enabled, then > partition value extractor class will be *automatically inferred* on the basis > of number of partition fields and whether or not hive style partitioning is > enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-992) For hive-style partitioned source data, partition columns synced with Hive will always have String type
[ https://issues.apache.org/jira/browse/HUDI-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-992: Fix Version/s: 0.14.1 (was: 0.14.0) > For hive-style partitioned source data, partition columns synced with Hive > will always have String type > --- > > Key: HUDI-992 > URL: https://issues.apache.org/jira/browse/HUDI-992 > Project: Apache Hudi > Issue Type: Bug > Components: bootstrap, meta-sync >Affects Versions: 0.9.0 >Reporter: Udit Mehrotra >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.14.1 > > > Currently bootstrap implementation is not able to handle partition columns > correctly when the source data has *hive-style partitioning*, as is also > mentioned in https://jira.apache.org/jira/browse/HUDI-915 > The schema inferred while performing bootstrap and stored in the commit > metadata does not have partition column schema(in case of hive partitioned > data). As a result during hive-sync when hudi tries to determine the type of > partition column from that schema, it would not find it and assume the > default data type *string*. > Here is where partition column schema is determined for hive-sync: > [https://github.com/apache/hudi/blob/master/hudi-hive-sync/src/main/java/org/apache/hudi/hive/util/HiveSchemaUtil.java#L417] > > Thus no matter what the data type of partition column is in the source data > (atleast what spark infers it as from the path), it will always be synced as > string. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6136) Add idempotency support to spark datasource writes
[ https://issues.apache.org/jira/browse/HUDI-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-6136: - Fix Version/s: 0.14.1 (was: 0.14.0) > Add idempotency support to spark datasource writes > -- > > Key: HUDI-6136 > URL: https://issues.apache.org/jira/browse/HUDI-6136 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > Would be good to add idempotency support to spark data-source writes. > Essentially, if the same batch is ingested again to hudi, hudi should skip > them. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1280) Add tool to capture earliest or latest offsets in kafka topics
[ https://issues.apache.org/jira/browse/HUDI-1280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-1280: - Fix Version/s: 0.14.1 (was: 0.14.0) > Add tool to capture earliest or latest offsets in kafka topics > --- > > Key: HUDI-1280 > URL: https://issues.apache.org/jira/browse/HUDI-1280 > Project: Apache Hudi > Issue Type: New Feature > Components: deltastreamer >Reporter: Balaji Varadarajan >Assignee: Trevorzhang >Priority: Major > Fix For: 0.14.1 > > > For bootstrapping cases using spark.write(), we need to capture offsets from > kafka topic and use it as checkpoint for subsequent read from Kafka topics. > > [https://github.com/apache/hudi/issues/1985] > We need to build this integration for smooth transition to deltastreamer. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2516) Upgrade to Junit 5.8.2
[ https://issues.apache.org/jira/browse/HUDI-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-2516: - Fix Version/s: 0.14.1 (was: 0.14.0) > Upgrade to Junit 5.8.2 > -- > > Key: HUDI-2516 > URL: https://issues.apache.org/jira/browse/HUDI-2516 > Project: Apache Hudi > Issue Type: Task > Components: Testing, tests-ci >Reporter: Raymond Xu >Assignee: liujinhui >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5018) Make user-provided copyOnWriteRecordSizeEstimate first precedence
[ https://issues.apache.org/jira/browse/HUDI-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-5018: - Fix Version/s: 0.14.1 (was: 0.14.0) > Make user-provided copyOnWriteRecordSizeEstimate first precedence > - > > Key: HUDI-5018 > URL: https://issues.apache.org/jira/browse/HUDI-5018 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: Raymond Xu >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > For estimated avg record size > https://hudi.apache.org/docs/configurations/#hoodiecopyonwriterecordsizeestimate > which is used here > https://github.com/apache/hudi/blob/86a1efbff1300603a8180111eae117c7f9dbd8a5/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L372 > Propose to respect user setting by following the precedence as below > 1) if user sets a value, then use it as is > 2) if user not setting it, infer from timeline commit metadata > 3) if timeline is empty, use a default (current: 1024) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1549) Programmatic way to fetch earliest commit retained
[ https://issues.apache.org/jira/browse/HUDI-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-1549: - Fix Version/s: 0.14.1 (was: 0.14.0) > Programmatic way to fetch earliest commit retained > --- > > Key: HUDI-1549 > URL: https://issues.apache.org/jira/browse/HUDI-1549 > Project: Apache Hudi > Issue Type: New Feature > Components: cleaning, timeline-server >Affects Versions: 0.9.0 >Reporter: sivabalan narayanan >Assignee: Pratyaksh Sharma >Priority: Major > Labels: query-eng, sev:normal, user-support-issues > Fix For: 0.14.1 > > > For GDPR deletions, it would be nice if customers can programmatically know > whats the earliest commit retained. > More context: https://github.com/apache/hudi/issues/2135 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1936) Introduce a optional property for conditional upsert
[ https://issues.apache.org/jira/browse/HUDI-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-1936: - Fix Version/s: 0.14.1 (was: 0.14.0) > Introduce a optional property for conditional upsert > - > > Key: HUDI-1936 > URL: https://issues.apache.org/jira/browse/HUDI-1936 > Project: Apache Hudi > Issue Type: New Feature > Components: Common Core, writer-core >Reporter: Biswajit mohapatra >Assignee: Biswajit mohapatra >Priority: Major > Labels: features, pull-request-available, sev:high > Fix For: 0.14.1 > > > If anyone wants to use custom upsert logic then they have to override the > Latest avro payload class which is only possible in java or scala . > Python developers have no such option . > Will be introducing a new payload class and a new key which can work in java > , scala and python > This class will be responsible for custom upsert logic and a new key > hoodie.update.key which will accept the columns which only need to be updated > > "hoodie.update.keys": "admission_date,name", #comma seperated key > "hoodie.datasource.write.payload.class": "com.hudiUpsert.hudiCustomUpsert" > #custom upsert key > > so this will only update the column admission_date and name in the target > table > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3582) Introduce Secondary Index to Improve HUDI Query Performance
[ https://issues.apache.org/jira/browse/HUDI-3582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-3582: - Fix Version/s: 0.14.1 (was: 0.14.0) > Introduce Secondary Index to Improve HUDI Query Performance > --- > > Key: HUDI-3582 > URL: https://issues.apache.org/jira/browse/HUDI-3582 > Project: Apache Hudi > Issue Type: New Feature > Components: index >Reporter: shibei >Assignee: shibei >Priority: Blocker > Fix For: 0.14.1 > > > In query processing, we need to scan many data blocks in HUDI table. However, > most of them may not match the query predicate even after using column > statistic info in the metadata table, row group level or page level > statistics in parquet files, etc. > The total data size of touched blocks determines the query speed, and how to > save IO has become the key point to improving query performance. To address > this problem, we introduce secondary index to improve hudi query performance. > In the initial implementation, secondary index based on lucene will be > carried out. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3560) Add docker image for spark3 hadoop3 and hive3
[ https://issues.apache.org/jira/browse/HUDI-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-3560: - Fix Version/s: 0.14.1 (was: 0.14.0) > Add docker image for spark3 hadoop3 and hive3 > - > > Key: HUDI-3560 > URL: https://issues.apache.org/jira/browse/HUDI-3560 > Project: Apache Hudi > Issue Type: Task >Reporter: sivabalan narayanan >Assignee: Rahil Chertara >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1588) Support multiple ordering fields via payload class config
[ https://issues.apache.org/jira/browse/HUDI-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-1588: - Fix Version/s: 0.14.1 (was: 0.14.0) > Support multiple ordering fields via payload class config > - > > Key: HUDI-1588 > URL: https://issues.apache.org/jira/browse/HUDI-1588 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Affects Versions: 0.9.0 >Reporter: Raymond Xu >Assignee: Pratyaksh Sharma >Priority: Major > Fix For: 0.14.1 > > > To make configuration simpler, we want to deprecate --source-ordering-field > config and combine it with payload class config so that users can plug in > custom payload class that handles record ordering. Also the logic can be > extended to look up for multiple fields for ordering. > > Discussion thread > https://lists.apache.org/thread.html/r884e47f21ce1f3f09ef12722fa05ba9900ba12429935c10167b9fce6%40%3Cdev.hudi.apache.org%3E -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6122) Call clean/compaction support custom options
[ https://issues.apache.org/jira/browse/HUDI-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-6122: - Fix Version/s: 0.14.1 (was: 0.14.0) > Call clean/compaction support custom options > > > Key: HUDI-6122 > URL: https://issues.apache.org/jira/browse/HUDI-6122 > Project: Apache Hudi > Issue Type: Improvement > Components: spark-sql >Reporter: zouxxyy >Assignee: zouxxyy >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5951) Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource
[ https://issues.apache.org/jira/browse/HUDI-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-5951: - Fix Version/s: 0.14.1 (was: 0.14.0) > Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource > --- > > Key: HUDI-5951 > URL: https://issues.apache.org/jira/browse/HUDI-5951 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > HUDI-372 adds support for the short name "hudi" in Spark Datasource read and > write (df.write.format("hudi"), df.read.format("hudi")). All places should > use "hudi" with format() now. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4626) Partitioning table by `_hoodie_partition_path` fails
[ https://issues.apache.org/jira/browse/HUDI-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4626: - Fix Version/s: 0.14.1 (was: 0.14.0) > Partitioning table by `_hoodie_partition_path` fails > > > Key: HUDI-4626 > URL: https://issues.apache.org/jira/browse/HUDI-4626 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.12.0 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Fix For: 0.14.1 > > > > Currently, creating a table partitioned by "_hoodie_partition_path" using > Glue catalog fails w/ the following exception: > {code:java} > AnalysisException: Found duplicate column(s) in the data schema and the > partition schema: _hoodie_partition_path > {code} > Using following DDL: > {code:java} > CREATE EXTERNAL TABLE `active_storage_attachments`( `_hoodie_commit_time` > string COMMENT '', `_hoodie_commit_seqno` string COMMENT '', > `_hoodie_record_key` string COMMENT '', `_hoodie_file_name` string COMMENT > '', `_change_operation_type` string COMMENT '', > `_upstream_event_processed_ts_ms` bigint COMMENT '', > `db_shard_source_partition` string COMMENT '', `_event_origin_ts_ms` bigint > COMMENT '', `_event_tx_id` bigint COMMENT '', `_event_lsn` bigint COMMENT > '', `_event_xmin` bigint COMMENT '', `id` bigint COMMENT '', `name` > string COMMENT '', `record_type` string COMMENT '', `record_id` bigint > COMMENT '', `blob_id` bigint COMMENT '', `created_at` timestamp COMMENT > '')PARTITIONED BY ( `_hoodie_partition_path` string COMMENT '')ROW FORMAT > SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH > SERDEPROPERTIES ( 'hoodie.query.as.ro.table'='false', 'path'='...') > STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'LOCATION > '...' > TBLPROPERTIES ( 'spark.sql.sources.provider'='hudi' ) > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4674) change the default value of inputFormat for the MOR table
[ https://issues.apache.org/jira/browse/HUDI-4674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4674: - Fix Version/s: 0.14.1 (was: 0.14.0) > change the default value of inputFormat for the MOR table > - > > Key: HUDI-4674 > URL: https://issues.apache.org/jira/browse/HUDI-4674 > Project: Apache Hudi > Issue Type: Improvement >Reporter: linfey.nie >Assignee: linfey.nie >Priority: Major > Labels: hudi-on-call, pull-request-available > Fix For: 0.14.1 > > > When we build a mor table, for example with Sparksql,the default value of > inputFormat is HoodieParquetRealtimeInputFormat.but when use hive sync > metadata and skip the _ro suffix for Read,The inputFormat of the original > table name should be HoodieParquetInputFormat,but now is not.I think we > should change the default value of inputFormat,just like cow table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2740) Support for snapshot querying on MOR table
[ https://issues.apache.org/jira/browse/HUDI-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-2740: - Fix Version/s: 0.14.1 (was: 0.14.0) > Support for snapshot querying on MOR table > -- > > Key: HUDI-2740 > URL: https://issues.apache.org/jira/browse/HUDI-2740 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4324) Remove useJdbc config from meta sync tools
[ https://issues.apache.org/jira/browse/HUDI-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4324: - Fix Version/s: 0.14.1 (was: 0.14.0) > Remove useJdbc config from meta sync tools > -- > > Key: HUDI-4324 > URL: https://issues.apache.org/jira/browse/HUDI-4324 > Project: Apache Hudi > Issue Type: Improvement > Components: meta-sync >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1872) Move HoodieFlinkStreamer into hudi-utilities module
[ https://issues.apache.org/jira/browse/HUDI-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-1872: - Fix Version/s: 0.14.1 (was: 0.14.0) > Move HoodieFlinkStreamer into hudi-utilities module > --- > > Key: HUDI-1872 > URL: https://issues.apache.org/jira/browse/HUDI-1872 > Project: Apache Hudi > Issue Type: Task > Components: flink >Reporter: Danny Chen >Assignee: Vinay >Priority: Major > Labels: pull-request-available, sev:normal > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4613) Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function
[ https://issues.apache.org/jira/browse/HUDI-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4613: - Fix Version/s: 0.14.1 (was: 0.14.0) > Avoid the use of regex expressions when call hoodieFileGroup#addLogFile > function > > > Key: HUDI-4613 > URL: https://issues.apache.org/jira/browse/HUDI-4613 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: lei w >Assignee: lei w >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > Attachments: withChange.png, withoutChange.png > > > When the number of logFile files exceeds a certain amount of data, the > construction of fsview will become very time-consuming. The reason is that > the LogFileComparator#compare method is frequently called when constructing a > filegroup, and regular expressions are used in this method. > {panel:title=build FileSystemView Log } > INFO view.AbstractTableFileSystemView: addFilesToView: NumFiles=60801, > NumFileGroups=200, FileGroupsCreationTime=34036, StoreTimeTaken=2 > {panel} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5575) Support any record key generation along w/ any partition path generation for row writer
[ https://issues.apache.org/jira/browse/HUDI-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-5575: - Fix Version/s: 0.14.1 (was: 0.14.0) > Support any record key generation along w/ any partition path generation for > row writer > --- > > Key: HUDI-5575 > URL: https://issues.apache.org/jira/browse/HUDI-5575 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Lokesh Jain >Assignee: sivabalan narayanan >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > > HUDI-5535 adds support for record key generation along w/ any partition path > generation. It also separates the record key generation and partition path > generation into separate interfaces. > This jira aims to add similar support for the row writer path in spark. > cc [~shivnarayan] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2034) Support explicit partition compaction strategy for flink batch compaction
[ https://issues.apache.org/jira/browse/HUDI-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-2034: - Fix Version/s: 0.14.1 (was: 0.14.0) > Support explicit partition compaction strategy for flink batch compaction > -- > > Key: HUDI-2034 > URL: https://issues.apache.org/jira/browse/HUDI-2034 > Project: Apache Hudi > Issue Type: Improvement > Components: compaction, flink >Reporter: Zheng yunhong >Assignee: Zheng yunhong >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > Support explicit partition compaction strategy for flink batch compaction. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4990) Parallelize deduplication in CLI tool
[ https://issues.apache.org/jira/browse/HUDI-4990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4990: - Fix Version/s: 0.14.1 (was: 0.14.0) > Parallelize deduplication in CLI tool > - > > Key: HUDI-4990 > URL: https://issues.apache.org/jira/browse/HUDI-4990 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: sivabalan narayanan >Priority: Minor > Fix For: 0.14.1 > > > The CLI tool command `repair deduplicate` repair one partition at a time. To > repair hundreds of partitions, this takes time. We should add a mode to take > multiple partition paths for the CLI and run the dedup job for multiple > partitions at the same time. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4106) Identify out of the box default performance flips for spark-sql
[ https://issues.apache.org/jira/browse/HUDI-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4106: - Fix Version/s: 0.14.1 (was: 0.14.0) > Identify out of the box default performance flips for spark-sql > --- > > Key: HUDI-4106 > URL: https://issues.apache.org/jira/browse/HUDI-4106 > Project: Apache Hudi > Issue Type: Improvement >Reporter: sivabalan narayanan >Priority: Major > Fix For: 0.14.1 > > > We had HUDI-2151 to track performance flips, but its been 1 year that we > combed through all configs. Lets do another round of combing through all > configs and come up with a new list to flip. > this ticket specifically tracks spark-sql layer configs. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3954) Don't keep the last commit before the earliest commit to retain
[ https://issues.apache.org/jira/browse/HUDI-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-3954: - Fix Version/s: 0.14.1 (was: 0.14.0) > Don't keep the last commit before the earliest commit to retain > --- > > Key: HUDI-3954 > URL: https://issues.apache.org/jira/browse/HUDI-3954 > Project: Apache Hudi > Issue Type: Improvement > Components: cleaning >Reporter: 董可伦 >Assignee: sivabalan narayanan >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > > Don't keep the last commit before the earliest commit to retain > According to the document of {{{}hoodie.cleaner.commits.retained{}}}: > Number of commits to retain, without cleaning. This will be retained for > num_of_commits * time_between_commits (scheduled). This also directly > translates into how much data retention the table supports for incremental > queries. > > We only need to keep the number of commit configured through parameters > {{{}hoodie.cleaner.commits.retained{}}}. > And the commit retained by clean is completed.This ensures that “This will be > retained for num_of_commits * time_between_commits” in the document. > So we don't need to keep the last commit before the earliest commit to > retain,If we want to keep more versions, we can increase the parameters > {{hoodie.cleaner.commits.retained}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6464) Implement Spark SQL Merge Into for tables without primary key
[ https://issues.apache.org/jira/browse/HUDI-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-6464: - Fix Version/s: 0.14.1 (was: 0.14.0) > Implement Spark SQL Merge Into for tables without primary key > - > > Key: HUDI-6464 > URL: https://issues.apache.org/jira/browse/HUDI-6464 > Project: Apache Hudi > Issue Type: New Feature > Components: spark-sql >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > Merge Into currently only matches on the primary key which pkless tables > don't have -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5101) Adding spark structured streaming tests to integ tests
[ https://issues.apache.org/jira/browse/HUDI-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-5101: - Fix Version/s: 0.14.1 (was: 0.14.0) > Adding spark structured streaming tests to integ tests > -- > > Key: HUDI-5101 > URL: https://issues.apache.org/jira/browse/HUDI-5101 > Project: Apache Hudi > Issue Type: Improvement > Components: tests-ci >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6552) Restructure FAQ on the website to categorize tuning and troubleshooting related questions separately.
[ https://issues.apache.org/jira/browse/HUDI-6552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-6552: - Fix Version/s: 0.14.1 (was: 0.14.0) > Restructure FAQ on the website to categorize tuning and troubleshooting > related questions separately. > -- > > Key: HUDI-6552 > URL: https://issues.apache.org/jira/browse/HUDI-6552 > Project: Apache Hudi > Issue Type: Task > Components: docs >Reporter: Bhavani Sudha >Assignee: Bhavani Sudha >Priority: Minor > Labels: pull-request-available > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3291) Flip Default record paylod to DefaultHoodieRecordPayload
[ https://issues.apache.org/jira/browse/HUDI-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-3291: - Fix Version/s: 0.14.1 (was: 0.14.0) > Flip Default record paylod to DefaultHoodieRecordPayload > > > Key: HUDI-3291 > URL: https://issues.apache.org/jira/browse/HUDI-3291 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > Original Estimate: 3h > Remaining Estimate: 3h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-304) Bring back spotless plugin
[ https://issues.apache.org/jira/browse/HUDI-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-304: Fix Version/s: 0.14.1 (was: 0.14.0) > Bring back spotless plugin > --- > > Key: HUDI-304 > URL: https://issues.apache.org/jira/browse/HUDI-304 > Project: Apache Hudi > Issue Type: Task > Components: code-quality, dev-experience, Testing >Reporter: Balaji Varadarajan >Assignee: Raymond Xu >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > Time Spent: 10m > Remaining Estimate: 0h > > spotless plugin has been turned off as the eclipse style format it was > referencing was removed due to compliance reasons. > We use google style eclipse format with some changes > 90c90 > < > --- > > > 242c242 > < value="100"/> > --- > > > value="120"/> > > The eclipse style sheet was originally obtained from > [https://github.com/google/styleguide] which CC -By 3.0 license which is not > compatible for source distribution (See > [https://www.apache.org/legal/resolved.html#cc-by]) > > We need to figure out a way to bring this back > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4345) Incorporate partition pruning for COW incremental query
[ https://issues.apache.org/jira/browse/HUDI-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4345: - Fix Version/s: 0.14.1 (was: 0.14.0) > Incorporate partition pruning for COW incremental query > --- > > Key: HUDI-4345 > URL: https://issues.apache.org/jira/browse/HUDI-4345 > Project: Apache Hudi > Issue Type: Improvement > Components: spark-sql >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2296) flink support ConsistencyGuard plugin
[ https://issues.apache.org/jira/browse/HUDI-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-2296: - Fix Version/s: 0.14.1 (was: 0.14.0) > flink support ConsistencyGuard plugin > --- > > Key: HUDI-2296 > URL: https://issues.apache.org/jira/browse/HUDI-2296 > Project: Apache Hudi > Issue Type: New Feature > Components: flink >Reporter: Gengxuhan >Assignee: Gengxuhan >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > flink support ConsistencyGuard plugin,When there is metadata latency in the > file system,Submit CKP once, and then start a new instant. Because the commit > cannot be seen, the data will be rolled back -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4878) Fix incremental cleaning for clean based on LATEST_FILE_VERSIONS
[ https://issues.apache.org/jira/browse/HUDI-4878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4878: - Fix Version/s: 0.14.1 (was: 0.14.0) > Fix incremental cleaning for clean based on LATEST_FILE_VERSIONS > > > Key: HUDI-4878 > URL: https://issues.apache.org/jira/browse/HUDI-4878 > Project: Apache Hudi > Issue Type: Improvement > Components: cleaning >Reporter: sivabalan narayanan >Assignee: nicolas paris >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > > clean based on LATEST_FILE_VERSIONS can be improved further since incremental > clean is not enabled. lets see if we can improvise. > > context from author: > > > Currently incremental cleaning is run for both KEEP_LATEST_COMMITS, > KEEP_LATEST_BY_HOURS > policies. It is not run when KEEP_LATEST_FILE_VERSIONS. > This can lead to not cleaning files. This PR fixes this problem by enabling > incremental cleaning for KEEP_LATEST_FILE_VERSIONS only. > Here is the scenario of the problem: > Say we have 3 committed files in partition-A and we add a new commit in > partition-B, and we trigger cleaning for the first time (full partition scan): > {{partition-A/ > commit-0.parquet > commit-1.parquet > commit-2.parquet > partition-B/ > commit-3.parquet}} > In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3, > the cleaner will remove the commit-0.parquet to keep 3 commits. > For the next cleaning, incremental cleaning will trigger, and won't consider > partition-A/ until a new commit change it. In case no later commit changes > partition-A then commit-1.parquet will stay forever. However it should be > removed by the cleaner. > Now if in case of KEEP_LATEST_FILE_VERSIONS, the cleaner will only keep > commit-2.parquet. Then it makes sense that incremental cleaning won't > consider partition-A until it is changed. Because there is only one commit. > This is why incremental cleaning should only be enabled with > KEEP_LATEST_FILE_VERSIONS > Hope this is clear enough > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4688) Decouple lazy rollback of failed writes from clean action in multi-writer
[ https://issues.apache.org/jira/browse/HUDI-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4688: - Fix Version/s: 0.14.1 (was: 0.14.0) > Decouple lazy rollback of failed writes from clean action in multi-writer > - > > Key: HUDI-4688 > URL: https://issues.apache.org/jira/browse/HUDI-4688 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Sagar Sumit >Priority: Critical > Fix For: 0.14.1 > > > What in case someone disables cleaning or runs only once every 50 commits. > The lazy rollback won't happen until 50 commits. So, decouple lazy rollbacks > of failed writes from cleaner in multi-writer. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6115) Harden expected corrupt record column in chained transformer when error table settings are on/off
[ https://issues.apache.org/jira/browse/HUDI-6115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-6115: - Fix Version/s: 0.14.1 (was: 0.14.0) > Harden expected corrupt record column in chained transformer when error table > settings are on/off > -- > > Key: HUDI-6115 > URL: https://issues.apache.org/jira/browse/HUDI-6115 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Harshal Patil >Assignee: Harshal Patil >Priority: Minor > Labels: pull-request-available > Fix For: 0.14.1 > > > When error table is enabled and transformers drops existing > corruptRecordColumn , that can lead to quarantine records getting dropped . > This pr aims at hardening expectation of corruptRecordColumn in output > schemas of transformer . -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6149) Add a tool to fetch table size for hudi tables
[ https://issues.apache.org/jira/browse/HUDI-6149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-6149: - Fix Version/s: 0.14.1 (was: 0.14.0) > Add a tool to fetch table size for hudi tables > -- > > Key: HUDI-6149 > URL: https://issues.apache.org/jira/browse/HUDI-6149 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3819) upgrade spring cve-2022-22965
[ https://issues.apache.org/jira/browse/HUDI-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-3819: - Fix Version/s: 0.14.1 (was: 0.14.0) > upgrade spring cve-2022-22965 > - > > Key: HUDI-3819 > URL: https://issues.apache.org/jira/browse/HUDI-3819 > Project: Apache Hudi > Issue Type: Bug > Components: cli >Affects Versions: 0.9.0, 0.10.1 >Reporter: Jason-Morries Adam >Assignee: Sagar Sumit >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > > We should upgrade the Spring Framework version at Hudi CLI because of > cve-2022-22965. The Qualys Scanner finds these packages and raises a warning > because of the existence of these files on the system. > The found files are: > /usr/lib/hudi/cli/lib/spring-beans-4.2.4.RELEASE.jar > /usr/lib/hudi/cli/lib/spring-core-4.2.4.RELEASE.jar > More Information: > Spring Framework: https://spring.io/projects/spring-framework > Spring project spring-framework release notes: > https://github.com/spring-projects/spring-framework/releases > CVE-2022-22965: https://tanzu.vmware.com/security/cve-2022-22965 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3742) Enable parquet enableVectorizedReader for spark incremental read to prevent pef regression
[ https://issues.apache.org/jira/browse/HUDI-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-3742: - Fix Version/s: 0.14.1 (was: 0.14.0) > Enable parquet enableVectorizedReader for spark incremental read to prevent > pef regression > --- > > Key: HUDI-3742 > URL: https://issues.apache.org/jira/browse/HUDI-3742 > Project: Apache Hudi > Issue Type: Improvement > Components: spark >Reporter: Tao Meng >Assignee: Tao Meng >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > > now we disable parquet enableVectorizedReader for mor incremental read, > and set "spark.sql.parquet.recordLevelFilter.enabled" = "true" to achieve > data filter > which is slow -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5498) Update docs for reading Hudi tables on Databricks runtime
[ https://issues.apache.org/jira/browse/HUDI-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-5498: - Fix Version/s: 0.14.1 (was: 0.14.0) > Update docs for reading Hudi tables on Databricks runtime > - > > Key: HUDI-5498 > URL: https://issues.apache.org/jira/browse/HUDI-5498 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.14.1 > > > We need to document how users can read Hudi tables on Databricks Spark > runtime. > Relevant fix: [https://github.com/apache/hudi/pull/7088] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2955) Upgrade Hadoop to 3.3.x
[ https://issues.apache.org/jira/browse/HUDI-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-2955: - Fix Version/s: 0.14.1 (was: 0.14.0) > Upgrade Hadoop to 3.3.x > --- > > Key: HUDI-2955 > URL: https://issues.apache.org/jira/browse/HUDI-2955 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Alexey Kudinkin >Assignee: Rahil Chertara >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.1 > > Attachments: Screen Shot 2021-12-07 at 2.32.51 PM.png > > > According to Hadoop compatibility matrix, this is a pre-requisite to > upgrading to JDK11: > !Screen Shot 2021-12-07 at 2.32.51 PM.png|width=938,height=230! > [https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions] > > *Upgrading Hadoop from 2.x to 3.x* > [https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+2.x+to+3.x+Upgrade+Efforts] > Everything (relevant to us) seems to be in a good shape, except Spark 2.2/.3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6281) Comprehensive schema evolution supports column change with a default value
[ https://issues.apache.org/jira/browse/HUDI-6281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-6281: - Fix Version/s: 0.14.1 (was: 0.14.0) > Comprehensive schema evolution supports column change with a default value > -- > > Key: HUDI-6281 > URL: https://issues.apache.org/jira/browse/HUDI-6281 > Project: Apache Hudi > Issue Type: New Feature > Components: core >Reporter: Nicholas Jiang >Assignee: Nicholas Jiang >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > Comprehensive schema evolution should support column change with a default > value, which could add column with a default value etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5075) Add support to rollback residual clustering after disabling clustering
[ https://issues.apache.org/jira/browse/HUDI-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-5075: - Fix Version/s: 0.14.1 (was: 0.14.0) > Add support to rollback residual clustering after disabling clustering > -- > > Key: HUDI-5075 > URL: https://issues.apache.org/jira/browse/HUDI-5075 > Project: Apache Hudi > Issue Type: Bug > Components: clustering >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.1 > > > if a user enabled clustering and after sometime disabled it due to whatever > reason, there is a chance that there is a pending clustering left in the > timeline. But once clustering is disabled, this could just be lying around. > but this could affect metadata table compaction whcih in turn might affect > the data table archival. > so, we need a way to fix this. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6121) Log the exception in the hudi commit kafka callback
[ https://issues.apache.org/jira/browse/HUDI-6121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-6121: - Fix Version/s: 0.14.1 (was: 0.14.0) > Log the exception in the hudi commit kafka callback > --- > > Key: HUDI-6121 > URL: https://issues.apache.org/jira/browse/HUDI-6121 > Project: Apache Hudi > Issue Type: Improvement >Reporter: ziqiao >Priority: Minor > Labels: pull-request-available > Fix For: 0.14.1 > > > Right now the kafka callback does not log the exception, and it would be hard > to find out why the kafka message sent failed without log -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3636) Clustering fails due to marker creation failure
[ https://issues.apache.org/jira/browse/HUDI-3636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-3636: - Fix Version/s: 0.14.1 (was: 0.14.0) > Clustering fails due to marker creation failure > --- > > Key: HUDI-3636 > URL: https://issues.apache.org/jira/browse/HUDI-3636 > Project: Apache Hudi > Issue Type: Bug > Components: multi-writer >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > > Scenario: multi-writer test, one writer doing ingesting with Deltastreamer > continuous mode, COW, inserts, async clustering and cleaning (partitions > under 2022/1, 2022/2), another writer with Spark datasource doing backfills > to different partitions (2021/12). > 0.10.0 no MT, clustering instant is inflight (failing it in the middle before > upgrade) ➝ 0.11 MT, with multi-writer configuration the same as before. > The clustering/replace instant cannot make progress due to marker creation > failure, failing the DS ingestion as well. Need to investigate if this is > timeline-server-based marker related or MT related. > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in > stage 46.0 failed 1 times, most recent failure: Lost task 2.0 in stage 46.0 > (TID 277) (192.168.70.231 executor driver): java.lang.RuntimeException: > org.apache.hudi.exception.HoodieException: > org.apache.hudi.exception.HoodieException: > java.util.concurrent.ExecutionException: > org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file > 2022/1/24/aa2f24d3-882f-4d48-b20e-9fcd3540c7a7-0_2-46-277_20220314101326706.parquet.marker.CREATE > Connect to localhost:26754 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] > failed: Connection refused (Connection refused) > at > org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121) > at > scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) > at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) > at scala.collection.TraversableOnce.to(TraversableOnce.scala:366) > at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364) > at scala.collection.AbstractIterator.to(Iterator.scala:1431) > at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358) > at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431) > at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345) > at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1431) > at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030) > at > org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hudi.exception.HoodieException: > org.apache.hudi.exception.HoodieException: > java.util.concurrent.ExecutionException: > org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file > 2022/1/24/aa2f24d3-882f-4d48-b20e-9fcd3540c7a7-0_2-46-277_20220314101326706.parquet.marker.CREATE > Connect to localhost:26754 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] > failed: Connection refused (Connection refused) > at > org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:94)
[jira] [Updated] (HUDI-4329) Add separate control for compaction operation sync/async mode
[ https://issues.apache.org/jira/browse/HUDI-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4329: - Fix Version/s: 0.14.1 (was: 0.14.0) > Add separate control for compaction operation sync/async mode > - > > Key: HUDI-4329 > URL: https://issues.apache.org/jira/browse/HUDI-4329 > Project: Apache Hudi > Issue Type: Improvement > Components: flink >Reporter: Shizhi Chen >Assignee: Shizhi Chen >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > > *Problem Review* > The compact operation sync/async in CompactionFunction is now controlled by > FlinkOptions#COMPACTION_ASYNC_ENABLED > {code:java} > public CompactFunction(Configuration conf) { > this.conf = conf; > this.asyncCompaction = StreamerUtil.needsAsyncCompaction(conf); > } > {code} > While in fact it cannot be switched to sync mode because the pipeline defined > by sync compaction will only include the clean but not compact operators. > {code:java} > // compaction > if (StreamerUtil.needsAsyncCompaction(conf)) { > return Pipelines.compact(conf, pipeline); > } else { > return Pipelines.clean(conf, pipeline); > } > {code} > *Improvement* > Add another separate control switch for compaction operation sync/async mode. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2733) Adding Thrift support in HiveSyncTool
[ https://issues.apache.org/jira/browse/HUDI-2733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-2733: - Fix Version/s: 0.14.1 (was: 0.14.0) > Adding Thrift support in HiveSyncTool > - > > Key: HUDI-2733 > URL: https://issues.apache.org/jira/browse/HUDI-2733 > Project: Apache Hudi > Issue Type: New Feature > Components: meta-sync, Utilities >Reporter: Satyam Raj >Assignee: Satyam Raj >Priority: Major > Labels: hive-sync, pull-request-available > Fix For: 0.14.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > Introduction of Thrift Metastore client to sync Hudi data in Hive warehouse. > Suggested client to integrate with: > https://github.com/akolb1/hclient/tree/master/tools-common -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5352) Jackson fails to serialize LocalDate when updating Delta Commit metadata
[ https://issues.apache.org/jira/browse/HUDI-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-5352: - Fix Version/s: 0.14.1 (was: 0.14.0) > Jackson fails to serialize LocalDate when updating Delta Commit metadata > > > Key: HUDI-5352 > URL: https://issues.apache.org/jira/browse/HUDI-5352 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: Alexey Kudinkin >Assignee: Raymond Xu >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > > Currently, running TestColumnStatsIndex on Spark 3.3 fails the MOR tests due > to Jackson not being able to serialize LocalData as is and requiring > additional JSR310 dependency. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS
[ https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-5238: - Fix Version/s: 0.14.1 (was: 0.14.0) > Hudi throwing "PipeBroken" exception during Merging on GCS > -- > > Key: HUDI-5238 > URL: https://issues.apache.org/jira/browse/HUDI-5238 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.12.1 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > > Originally reported at [https://github.com/apache/hudi/issues/7234] > --- > > Root-cause: > Basically, the reason it’s failing is following: # GCS uses > PipeInputStream/PipeOutputStream comprising reading/writing ends of the > “pipe” it’s using for unidirectional comm b/w Threads > # PipeInputStream (for whatever reason) remembers the thread that actually > wrote into the pipe > # In BoundedInMemoryQueue we’re bootstrapping new executors (read, threads) > for reading and _writing_ (it’s only used in HoodieMergeHandle, and in > bulk-insert) > # When we’re done writing in HoodieMergeHelper, we’re shutting down *first* > BIMQ, then the HoodieMergeHandle, and that’s exactly the reason why it’s > failing > > Issue has been introduced at [https://github.com/apache/hudi/pull/4264/files] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3695) Add orc reader in HoodieBaseRelation
[ https://issues.apache.org/jira/browse/HUDI-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-3695: - Fix Version/s: 0.14.1 (was: 0.14.0) > Add orc reader in HoodieBaseRelation > > > Key: HUDI-3695 > URL: https://issues.apache.org/jira/browse/HUDI-3695 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.11.0 >Reporter: miomiocat >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > the ORC reader is not supported in HoodieBaseRelation after HUDI-3338 > It should be restored to read ORC format based hudi table or else a > UnsupportedOperationException will be thrown > > code details: > [https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala#L323] > -- This message was sent by Atlassian Jira (v8.20.10#820010)