[jira] [Updated] (HUDI-7993) Support pruning and skipping with meta fields
[ https://issues.apache.org/jira/browse/HUDI-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7993: - Labels: pull-request-available (was: ) > Support pruning and skipping with meta fields > - > > Key: HUDI-7993 > URL: https://issues.apache.org/jira/browse/HUDI-7993 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8015) [Glue] Fix Glue Meta Sync Failure on base path change
[ https://issues.apache.org/jira/browse/HUDI-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-8015: - Labels: pull-request-available (was: ) > [Glue] Fix Glue Meta Sync Failure on base path change > - > > Key: HUDI-8015 > URL: https://issues.apache.org/jira/browse/HUDI-8015 > Project: Apache Hudi > Issue Type: Task >Reporter: Vamsi Karnika >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8014) Fix Exception "FileID of partition path xxx=xx does not exist"
[ https://issues.apache.org/jira/browse/HUDI-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-8014: - Labels: pull-request-available (was: ) > Fix Exception "FileID of partition path xxx=xx does not exist" > -- > > Key: HUDI-8014 > URL: https://issues.apache.org/jira/browse/HUDI-8014 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Aditya Goenka >Assignee: Vova Kolmakov >Priority: Critical > Labels: pull-request-available > Fix For: 0.15.1 > > > [https://github.com/apache/hudi/issues/11202] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8012) Update checkstyle.xml based on the new release
[ https://issues.apache.org/jira/browse/HUDI-8012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-8012: - Labels: pull-request-available (was: ) > Update checkstyle.xml based on the new release > -- > > Key: HUDI-8012 > URL: https://issues.apache.org/jira/browse/HUDI-8012 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > > checkstyle.xml works with older checkstyle release, 8.20, only. We need to > make it work with recent checkstyle releases in IntelliJ(e.g., 10.12.5), see > following error. > {code:java} > com.puppycrawl.tools.checkstyle.api.CheckstyleException: cannot initialize > module TreeWalker - TreeWalker is not allowed as a parent of LineLength > Please review 'Parent Module' section for this Check in web documentation if > Check is standard. > at com.puppycrawl.tools.checkstyle.Checker.setupChild(Checker.java:486) > at > com.puppycrawl.tools.checkstyle.AbstractAutomaticBean.configure(AbstractAutomaticBean.java:207) > at > org.infernus.idea.checkstyle.service.cmd.OpCreateChecker.execute(OpCreateChecker.java:61) > at > org.infernus.idea.checkstyle.service.cmd.OpCreateChecker.execute(OpCreateChecker.java:26) > at > org.infernus.idea.checkstyle.service.CheckstyleActionsImpl.executeCommand(CheckstyleActionsImpl.java:130) > at > org.infernus.idea.checkstyle.service.CheckstyleActionsImpl.createChecker(CheckstyleActionsImpl.java:60) > at > org.infernus.idea.checkstyle.service.CheckstyleActionsImpl.createChecker(CheckstyleActionsImpl.java:51) > at > org.infernus.idea.checkstyle.checker.CheckerFactoryWorker.run(CheckerFactoryWorker.java:46) > Caused by: com.puppycrawl.tools.checkstyle.api.CheckstyleException: > TreeWalker is not allowed as a parent of LineLength Please review 'Parent > Module' section for this Check in web documentation if Check is standard. > at > com.puppycrawl.tools.checkstyle.TreeWalker.setupChild(TreeWalker.java:140) > at > com.puppycrawl.tools.checkstyle.AbstractAutomaticBean.configure(AbstractAutomaticBean.java:207) > at com.puppycrawl.tools.checkstyle.Checker.setupChild(Checker.java:481) > ... 7 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8011) Allow schema.on.read with positional merging
[ https://issues.apache.org/jira/browse/HUDI-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-8011: - Labels: pull-request-available (was: ) > Allow schema.on.read with positional merging > > > Key: HUDI-8011 > URL: https://issues.apache.org/jira/browse/HUDI-8011 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core, spark >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > Internal schema doesn't have the positional column, so it will fail during > pruning. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8005) New lock provider implementation
[ https://issues.apache.org/jira/browse/HUDI-8005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-8005: - Labels: pull-request-available (was: ) > New lock provider implementation > > > Key: HUDI-8005 > URL: https://issues.apache.org/jira/browse/HUDI-8005 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Davis Zhang >Assignee: Davis Zhang >Priority: Major > Labels: pull-request-available > > h2. Estimated effort: 2 days > > *New LP is only DynamoDb based. Zookeeper is beyond the scope here.* > > As of today, LP like dynamoDb generates a per-table LP attribute > {{partition-key}} which is used as the name of the lock which readers and > writers should grab on the DDB side. Its schema is {{- chars of the table uuid>}} to ensure uniqueness and 1 to 1 mapping between > the key and the table. The table UUID is purely onehouse specific stuff which > is not accessible from hudi writers' context. Hudi writer only have access to > HoodieWriterConfig and HoodieTableConfig. This means the partition key is > absent from the knowledge of hudi writers initiated by SQL. > > The proposed solution is to change the schema of {{partition-key}} to be > {{-}} . Considering table name and table > base path can be derived from writer configs by hudi writer, this addresses > the issue. > > Properties of partition key: * {*}Uniqueness{*}: The lock key must be unique > for each resource you want to lock. This ensures that different resources are > independently locked. > * {*}Meaningful Naming{*}: Use meaningful names for lock keys to make it > clear what resource is being locked. This is particularly useful for > debugging and maintenance. > * {*}DynamoDB Partition Key Limits{*}: DynamoDB has limits on the size of > partition keys. The maximum length for a partition key is 2048 bytes when > using UTF-8 encoding. Ensure your lock keys do not exceed this limit > ** As of today, hudi does not enforce length on table name. The follow up > task is tracked here . It is beyond M1 scope. > > For now, if the newly generated partition key is more than 2048 bytes, *we > will simply truncate the table name* to ensure the hash part can fit in. > {{-}} > > h3. Hash function > We can use any main stream non-cryptographic hash libraray like murmur, > FarmHash. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8009) Optimize the code of HoodieTable#getPendingCommitTimeline
[ https://issues.apache.org/jira/browse/HUDI-8009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-8009: - Labels: pull-request-available (was: ) > Optimize the code of HoodieTable#getPendingCommitTimeline > - > > Key: HUDI-8009 > URL: https://issues.apache.org/jira/browse/HUDI-8009 > Project: Apache Hudi > Issue Type: Improvement >Reporter: bradley >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4625) Clean up KafkaOffsetGen
[ https://issues.apache.org/jira/browse/HUDI-4625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4625: - Labels: pull-request-available (was: ) > Clean up KafkaOffsetGen > --- > > Key: HUDI-4625 > URL: https://issues.apache.org/jira/browse/HUDI-4625 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: Alexey Kudinkin >Assignee: Vova Kolmakov >Priority: Major > Labels: pull-request-available > > There are a few issues w/in KafkaOffsetGen that we should follow-up on > annotated w/ corresponding TODOs: > # Using proper retrying client (instead of using sleeps for coordination) > # Cleaning up incorrect assertions -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8008) Resolve Proto Schemas returned from Confluent registry
[ https://issues.apache.org/jira/browse/HUDI-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-8008: - Labels: pull-request-available (was: ) > Resolve Proto Schemas returned from Confluent registry > -- > > Key: HUDI-8008 > URL: https://issues.apache.org/jira/browse/HUDI-8008 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > Labels: pull-request-available > > The Confluent Schema Registry can return a proto schema with references to > other proto schemas. These references need to be resolved. The registry SDK > will handle this automatically for us so we can update to use that instead of > our own http client setup. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8007) ShowInvalidParquetProcedure support delete by parameter
[ https://issues.apache.org/jira/browse/HUDI-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-8007: - Labels: pull-request-available (was: ) > ShowInvalidParquetProcedure support delete by parameter > --- > > Key: HUDI-8007 > URL: https://issues.apache.org/jira/browse/HUDI-8007 > Project: Apache Hudi > Issue Type: New Feature > Components: cli >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8006) glue sync may not update columns
[ https://issues.apache.org/jira/browse/HUDI-8006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-8006: - Labels: pull-request-available (was: ) > glue sync may not update columns > - > > Key: HUDI-8006 > URL: https://issues.apache.org/jira/browse/HUDI-8006 > Project: Apache Hudi > Issue Type: Bug > Components: meta-sync >Reporter: nicolas paris >Assignee: nicolas paris >Priority: Major > Labels: pull-request-available > > Due to async calls not ended on time, glue sync may consider table schema not > up to date in case of sequence such as: > # promote type > # add a table property > then 2 will retrieve the table schema before 1, and then 1 changes will be > discarded -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8002) Add Flink 1.15 and 1.14 bundle validation
[ https://issues.apache.org/jira/browse/HUDI-8002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-8002: - Labels: pull-request-available (was: ) > Add Flink 1.15 and 1.14 bundle validation > -- > > Key: HUDI-8002 > URL: https://issues.apache.org/jira/browse/HUDI-8002 > Project: Apache Hudi > Issue Type: Test > Components: tests-ci >Reporter: Jonathan Vexler >Assignee: Vova Kolmakov >Priority: Major > Labels: pull-request-available > > Removed in HUDI-7999 but might need to add back if we are not discontinuing > support -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7964) Partitions not created correctly with SQL when multiple partitions specified out of order
[ https://issues.apache.org/jira/browse/HUDI-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7964: - Labels: pull-request-available spark-sql (was: spark-sql) > Partitions not created correctly with SQL when multiple partitions specified > out of order > - > > Key: HUDI-7964 > URL: https://issues.apache.org/jira/browse/HUDI-7964 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sagar Sumit >Priority: Major > Labels: pull-request-available, spark-sql > Fix For: 1.0.0 > > Attachments: Screenshot 2024-07-06 at 11.34.17 AM.png, Screenshot > 2024-07-11 at 5.43.41 PM.png > > > When multiple partitions are specified out of order (as compared to the order > of fields in the create table command), the partitioning on storage is > incorrect. Test script (notice that create table or insert into command has > city and then state, while the partitioned by clause has state first and then > city): > {code:java} > DROP TABLE IF EXISTS hudi_table_mlp; > CREATE TABLE hudi_table_mlp ( > ts BIGINT, > id STRING, > rider STRING, > driver STRING, > fare DOUBLE, > city STRING, > state STRING) > USING HUDI options( > primaryKey ='id', > preCombineField = 'ts') > PARTITIONED BY (state, city)location 'file:///tmp/hudi_table_mlp'; > INSERT INTO hudi_table_mlp VALUES > (1695159649,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco','california'); > INSERT INTO hudi_table_mlp VALUES > (1695091554,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70,'sunnyvale','california'); > INSERT INTO hudi_table_mlp VALUES > (1695332066,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'austin','texas'); > INSERT INTO hudi_table_mlp VALUES > (1695516137,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'houston','texas'); > {code} > This creates partition as follows (note that city and state values are > swapped): > !Screenshot 2024-07-11 at 5.43.41 PM.png|width=737,height=335! > Now, if i query with state='texas' filter, there are no results: > {code:java} > spark-sql> select * from hudi_table_mlp where state='texas'; -- no results -- > Time taken: 0.356 seconds {code} > I have tested this with master, 0.15.0 and 0.14.1, so it's not a recent > regression. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8003) Add overwrite payload for hive for record reader
[ https://issues.apache.org/jira/browse/HUDI-8003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-8003: - Labels: pull-request-available (was: ) > Add overwrite payload for hive for record reader > > > Key: HUDI-8003 > URL: https://issues.apache.org/jira/browse/HUDI-8003 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > Missing from hive, so if overwrite merger strategy is chosen, it will throw > an exception currently. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8001) Insert overwrite failed due to missing 'path' property when using Spark 3.5.1 and Hudi 1.0.0
[ https://issues.apache.org/jira/browse/HUDI-8001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-8001: - Labels: pull-request-available (was: ) > Insert overwrite failed due to missing 'path' property when using Spark 3.5.1 > and Hudi 1.0.0 > > > Key: HUDI-8001 > URL: https://issues.apache.org/jira/browse/HUDI-8001 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ma Jian >Priority: Major > Labels: pull-request-available > > The issue with Spark 3.5.1 arises because the > {{InsertIntoHoodieTableCommand}} chain calls the initialization of the > {{HoodieFileIndex}} class. For v1 tables, the path is stored in > {{{}CatalogTable#CatalogStorageFormat#storageProperties{}}}, but not in > {{{}CatalogTable#properties{}}}. When Spark reloads the table, it removes the > path key from {{{}CatalogTable#CatalogStorageFormat#storageProperties{}}}. > Consequently, {{InsertIntoHoodieTableCommand}} in Hudi cannot retrieve the > path from either {{CatalogTable#CatalogStorageFormat#storageProperties}} or > {{CatalogTable#properties}} during {{{}deduceOverwriteConfig{}}}. This > absence of the path key in {{combinedOpts}} leads to an error when > initializing {{{}HoodieFileIndex{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7999) Disable ci testing for spark versions less than 3.3
[ https://issues.apache.org/jira/browse/HUDI-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7999: - Labels: pull-request-available (was: ) > Disable ci testing for spark versions less than 3.3 > --- > > Key: HUDI-7999 > URL: https://issues.apache.org/jira/browse/HUDI-7999 > Project: Apache Hudi > Issue Type: Test > Components: tests-ci >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > We will be removing support for spark 2.4, 3.0, 3.1, 3.2. Before we remove > them we will transition our ci to run all the tests on higher versions of > spark. To prevent ci failures during this transition, we will stop the ci > from running tests on the versions to be discontinued. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7998) Failed to insert overwrite hudi table when defining partition column with int type
[ https://issues.apache.org/jira/browse/HUDI-7998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7998: - Labels: pull-request-available (was: ) > Failed to insert overwrite hudi table when defining partition column with int > type > -- > > Key: HUDI-7998 > URL: https://issues.apache.org/jira/browse/HUDI-7998 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: KnightChess >Assignee: KnightChess >Priority: Major > Labels: pull-request-available > > [https://github.com/apache/hudi/issues/11623] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7997) fix hivesync to support promotion type correctly
[ https://issues.apache.org/jira/browse/HUDI-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7997: - Labels: pull-request-available (was: ) > fix hivesync to support promotion type correctly > > > Key: HUDI-7997 > URL: https://issues.apache.org/jira/browse/HUDI-7997 > Project: Apache Hudi > Issue Type: Bug > Components: meta-sync >Reporter: nicolas paris >Assignee: nicolas paris >Priority: Major > Labels: pull-request-available > > see https://github.com/apache/hudi/issues/11599 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7919) Make integration tests run on Spark 3.5
[ https://issues.apache.org/jira/browse/HUDI-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7919: - Labels: pull-request-available (was: ) > Make integration tests run on Spark 3.5 > --- > > Key: HUDI-7919 > URL: https://issues.apache.org/jira/browse/HUDI-7919 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > The job "[integration-tests (spark2.4, > spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz)|https://github.com/apache/hudi/actions/runs/9636476688/job/26574480698#logs]"; > in Github Java CI should run on Spark 3.5 after we can remove Spark 2 > support. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7996) Store partition type with partition fields in table configs
[ https://issues.apache.org/jira/browse/HUDI-7996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7996: - Labels: pull-request-available (was: ) > Store partition type with partition fields in table configs > --- > > Key: HUDI-7996 > URL: https://issues.apache.org/jira/browse/HUDI-7996 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7995) support decoding number(int, long, double) to fixed field when using JsonKafkaSource
[ https://issues.apache.org/jira/browse/HUDI-7995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7995: - Labels: pull-request-available (was: ) > support decoding number(int, long, double) to fixed field when using > JsonKafkaSource > > > Key: HUDI-7995 > URL: https://issues.apache.org/jira/browse/HUDI-7995 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: Kong Wei >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7934) RocksDBDAO prefixDelete function doesn't delete the last entry
[ https://issues.apache.org/jira/browse/HUDI-7934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7934: - Labels: pull-request-available (was: ) > RocksDBDAO prefixDelete function doesn't delete the last entry > -- > > Key: HUDI-7934 > URL: https://issues.apache.org/jira/browse/HUDI-7934 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vova Kolmakov >Assignee: Vova Kolmakov >Priority: Minor > Labels: pull-request-available > > [https://github.com/apache/hudi/issues/11075] > getRocksDB().delete(lastEntry.getBytes()); > should be changed to > getRocksDB().delete(managedHandlesMap.get(columnFamilyName), > lastEntry.getBytes()); > And UT (TestRocksDBDAO) must be fixed appropriately. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7989) Fix secondary index updates with other indexes
[ https://issues.apache.org/jira/browse/HUDI-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7989: - Labels: pull-request-available (was: ) > Fix secondary index updates with other indexes > -- > > Key: HUDI-7989 > URL: https://issues.apache.org/jira/browse/HUDI-7989 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Sagar Sumit >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7988) ListingBasedRollbackStrategy support logcompact
[ https://issues.apache.org/jira/browse/HUDI-7988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7988: - Labels: pull-request-available (was: ) > ListingBasedRollbackStrategy support logcompact > --- > > Key: HUDI-7988 > URL: https://issues.apache.org/jira/browse/HUDI-7988 > Project: Apache Hudi > Issue Type: Bug >Reporter: KnightChess >Assignee: KnightChess >Priority: Major > Labels: pull-request-available > > [1. > https://github.com/apache/hudi/issues/11589|https://github.com/apache/hudi/issues/11589] > as this issue describe, not support log compact rollback when not use > marker, will throw > {{`org.apache.hudi.exception.HoodieRollbackException: Unknown listing type, > during rollback of`}} > {{}} > {{2. and, if instance is complete, also can not delete logcompact file > because it's seen as `compact action`, only delete base file}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7986) Make precombine field optional with Dedup feature for Mutable Streams
[ https://issues.apache.org/jira/browse/HUDI-7986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7986: - Labels: pull-request-available (was: ) > Make precombine field optional with Dedup feature for Mutable Streams > - > > Key: HUDI-7986 > URL: https://issues.apache.org/jira/browse/HUDI-7986 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sivaguru Kannan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7985) Support more formats in timestamp logical types in Json Avro converter
[ https://issues.apache.org/jira/browse/HUDI-7985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7985: - Labels: pull-request-available (was: ) > Support more formats in timestamp logical types in Json Avro converter > -- > > Key: HUDI-7985 > URL: https://issues.apache.org/jira/browse/HUDI-7985 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > We need to make sure "2024-06-03 13:42:34.951+00:00" is supported in > timestamp logical type. > * ISO 8601 supports the zone offset in the standard, e.g., {{+01:00}} , and > {{Z}} is the zone offset equivalent to {{+00:00}} or UTC > ([ref1|https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators]) > * {{2011-12-03T10:15:30+01:00}} conforms to ISO 8601 with {{T}} as the > separation character > * There are systems that use \{{ }} (space) instead of {{T}} as the > separation (other parts are the same). References indicate that ISO-8601 > used to allow this by _mutual agreement_ > ([ref2|https://stackoverflow.com/questions/30201003/how-to-deal-with-optional-t-in-iso-8601-timestamp-in-java-8-jsr-310-threet], > > [ref3|https://www.reddit.com/r/ISO8601/comments/173r61j/t_vs_space_separation_of_date_and_time/]) > * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} can successfully parse > timestamps like {{2024-05-13T23:53:36.004Z}} , already supported in > {{{}MercifulJsonConverter{}}}, and additionally {{2011-12-03T10:15:30+01:00}} > with zone offset (which is not supported in {{MercifulJsonConverter}} yet) > * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} cannot parse the timestamp with > space as the separator, like {{2011-12-03 10:15:30+01:00}} . But with a > simple twist of the formatter, it can be easily supported. > My take is we should change the formatter of the timestamp logical types to > support zone offset and space character as the separator (which is backwards > compatible), instead of introducing a new config of format (assuming that > common use cases just have space character as the variant). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7938) Missed HoodieSparkKryoRegistrar in Hadoop config by default
[ https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7938: - Labels: pull-request-available (was: ) > Missed HoodieSparkKryoRegistrar in Hadoop config by default > --- > > Key: HUDI-7938 > URL: https://issues.apache.org/jira/browse/HUDI-7938 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > Labels: pull-request-available > > HUDI-7567 Add schema evolution to the filegroup reader (#10957), > but broke integration with PySpark. > When trying to call > {quote}df_load = > spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path) > df_load.collect() > {quote} > > got: > > {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 > (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException > at org.apache.hadoop.conf.Configuration.(Configuration.java:842) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) > at > org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) > at > org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {quote} > Spark 3.4.3 was used. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7980) Optimize the configuration content when performing clustering with row writer
[ https://issues.apache.org/jira/browse/HUDI-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7980: - Labels: pull-request-available (was: ) > Optimize the configuration content when performing clustering with row writer > - > > Key: HUDI-7980 > URL: https://issues.apache.org/jira/browse/HUDI-7980 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ma Jian >Priority: Major > Labels: pull-request-available > > Currently, the row writer defaults to snapshot reads for all tables. However, > this method is relatively inefficient for MOR (Merge on Read) tables when > there are no logs. Therefore, we should optimize this part of the > configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7976) Fix BUG introduced in HUDI-7955 due to usage of wrong class
[ https://issues.apache.org/jira/browse/HUDI-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7976: - Labels: pull-request-available (was: ) > Fix BUG introduced in HUDI-7955 due to usage of wrong class > --- > > Key: HUDI-7976 > URL: https://issues.apache.org/jira/browse/HUDI-7976 > Project: Apache Hudi > Issue Type: Bug >Reporter: voon >Assignee: voon >Priority: Major > Labels: pull-request-available > > In the bugfix for HUDI-7955, the wrong class for invoking {{getTimestamp > }}was used. > # {*}Wrong{*}: org.apache.hadoop.hive.common.type.Timestamp > # {*}Correct{*}: org.apache.hadoop.hive.serde2.io.TimestampWritableV2 > > !https://git.garena.com/shopee/data-infra/hudi/uploads/eeff29b3e741c65eeb48f9901fa28da0/image.png|width=468,height=235! > > Submitting a bugfix to fix this bugfix... > Log levels for the exception block is also changed to warn so errors will be > printed out. > On top of that, we have simplified the {{getMillis}} shim to remove the > method that was added in HUDI-7955 to standardise it with how {{getDays}} is > written. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7979) Fix out of the box defaults with spillable memory configs
[ https://issues.apache.org/jira/browse/HUDI-7979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7979: - Labels: pull-request-available (was: ) > Fix out of the box defaults with spillable memory configs > -- > > Key: HUDI-7979 > URL: https://issues.apache.org/jira/browse/HUDI-7979 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core, writer-core >Reporter: sivabalan narayanan >Priority: Major > Labels: pull-request-available > > Looks like we are very conservative wrt memory configs used for spillable map > based FSV. > > For eg, we are only allocating 15Mb out of the box to file groups when using > spillable map based FSV. > public long getMaxMemoryForFileGroupMap() \{ > long totalMemory = getLong(SPILLABLE_MEMORY); > return totalMemory - getMaxMemoryForPendingCompaction() - > getMaxMemoryForBootstrapBaseFile(); > } > > SPILLABLE_MEMORY = default is 100Mb. > getMaxMemoryForPendingCompaction = 80% of 100MB. > getMaxMemoryForBootstrapBaseFile = 5% of 100Mb. > so, overall, out of the box we are allocating only 15Mb for > getMaxMemoryForFileGroupMap. > ref: > [https://github.com/apache/hudi/blob/bb0621edee97507cf2460e8cb57b5307510b917e/hudi-[…]/apache/hudi/common/table/view/FileSystemViewStorageConfig.java|https://github.com/apache/hudi/blob/bb0621edee97507cf2460e8cb57b5307510b917e/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewStorageConfig.java#L224] > Wondering do we even need 80% for pending compaction tracker in our FSV. I am > thinking to make it 15%. so that we can give more memory to actual file > groups. We may not have lot of pending compactions for a given table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7978) Update docs for older versions to state that partitions should be ordered when creating multiple partitions
[ https://issues.apache.org/jira/browse/HUDI-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7978: - Labels: pull-request-available (was: ) > Update docs for older versions to state that partitions should be ordered > when creating multiple partitions > --- > > Key: HUDI-7978 > URL: https://issues.apache.org/jira/browse/HUDI-7978 > Project: Apache Hudi > Issue Type: Sub-task > Components: docs >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7977) improve bucket index paritioner
[ https://issues.apache.org/jira/browse/HUDI-7977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7977: - Labels: pull-request-available (was: ) > improve bucket index paritioner > --- > > Key: HUDI-7977 > URL: https://issues.apache.org/jira/browse/HUDI-7977 > Project: Apache Hudi > Issue Type: Improvement > Components: index >Reporter: KnightChess >Assignee: KnightChess >Priority: Major > Labels: pull-request-available > > imporve {{BucketIndexUtil}} partitionIndex algorithm make the data be evenly > distributed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7975) Transfer extrametada to new commits when new data is not ingeested to trigger table services on the dataset
[ https://issues.apache.org/jira/browse/HUDI-7975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7975: - Labels: pull-request-available (was: ) > Transfer extrametada to new commits when new data is not ingeested to trigger > table services on the dataset > --- > > Key: HUDI-7975 > URL: https://issues.apache.org/jira/browse/HUDI-7975 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Surya Prasanna Yalla >Assignee: Surya Prasanna Yalla >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7974) Create empty clean commit at a cadence and make it configurable
[ https://issues.apache.org/jira/browse/HUDI-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7974: - Labels: pull-request-available (was: ) > Create empty clean commit at a cadence and make it configurable > --- > > Key: HUDI-7974 > URL: https://issues.apache.org/jira/browse/HUDI-7974 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Surya Prasanna Yalla >Assignee: Surya Prasanna Yalla >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7970) Add support to read partition fields when partition type is also stored in table config
[ https://issues.apache.org/jira/browse/HUDI-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7970: - Labels: pull-request-available (was: ) > Add support to read partition fields when partition type is also stored in > table config > --- > > Key: HUDI-7970 > URL: https://issues.apache.org/jira/browse/HUDI-7970 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > > In HUDI-7902, we will modify the config value `hoodie.table.partition.fields` > to also store partition type. This PR aims to make sure that the getter and > other functions accessing this field remain consistent in behaviour with the > new value type. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7969) Fix data loss caused by concurrent write and clean
[ https://issues.apache.org/jira/browse/HUDI-7969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7969: - Labels: pull-request-available (was: ) > Fix data loss caused by concurrent write and clean > -- > > Key: HUDI-7969 > URL: https://issues.apache.org/jira/browse/HUDI-7969 > Project: Apache Hudi > Issue Type: Bug >Reporter: Xinyu Zou >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7692) Move MDT partiiton type code in HoodieMetadataPaylaod to MetadataPartitionType
[ https://issues.apache.org/jira/browse/HUDI-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7692: - Labels: hudi-1.0.0-beta2 pull-request-available (was: hudi-1.0.0-beta2) > Move MDT partiiton type code in HoodieMetadataPaylaod to MetadataPartitionType > -- > > Key: HUDI-7692 > URL: https://issues.apache.org/jira/browse/HUDI-7692 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Labels: hudi-1.0.0-beta2, pull-request-available > Fix For: 1.0.0 > > > https://github.com/apache/hudi/pull/10352#discussion_r1584137942 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7025) Merge Index and Functional Index Config
[ https://issues.apache.org/jira/browse/HUDI-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7025: - Labels: hudi-1.0.0-beta2 pull-request-available (was: hudi-1.0.0-beta2) > Merge Index and Functional Index Config > --- > > Key: HUDI-7025 > URL: https://issues.apache.org/jira/browse/HUDI-7025 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Minor > Labels: hudi-1.0.0-beta2, pull-request-available > Fix For: 1.0.0 > > > There is {{INDEX}} sub-group name in `ConfigGroups`. Functional index configs > can be consolidated within that. > > https://github.com/apache/hudi/pull/9872#discussion_r1377115549 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7967) Robust handling of spark task failures and retries
[ https://issues.apache.org/jira/browse/HUDI-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7967: - Labels: RobustWrites pull-request-available (was: RobustWrites) > Robust handling of spark task failures and retries > --- > > Key: HUDI-7967 > URL: https://issues.apache.org/jira/browse/HUDI-7967 > Project: Apache Hudi > Issue Type: Epic > Components: reader-core, writer-core >Reporter: sivabalan narayanan >Priority: Major > Labels: RobustWrites, pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7968) RFC for robust handling of spark task failures and retries
[ https://issues.apache.org/jira/browse/HUDI-7968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7968: - Labels: RobustWrites pull-request-available (was: RobustWrites) > RFC for robust handling of spark task failures and retries > -- > > Key: HUDI-7968 > URL: https://issues.apache.org/jira/browse/HUDI-7968 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: sivabalan narayanan >Priority: Major > Labels: RobustWrites, pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7962) Add show create table command
[ https://issues.apache.org/jira/browse/HUDI-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7962: - Labels: pull-request-available (was: ) > Add show create table command > - > > Key: HUDI-7962 > URL: https://issues.apache.org/jira/browse/HUDI-7962 > Project: Apache Hudi > Issue Type: New Feature > Components: cli >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7966) NPE from AvroSchemaUtils.createNewSchemaFromFieldsWithReference
[ https://issues.apache.org/jira/browse/HUDI-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7966: - Labels: pull-request-available (was: ) > NPE from AvroSchemaUtils.createNewSchemaFromFieldsWithReference > --- > > Key: HUDI-7966 > URL: https://issues.apache.org/jira/browse/HUDI-7966 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Running > [long-running|https://github.com/apache/hudi/blob/dbfe8b23c0b4f160b26379053873cfc2a46acef4/docker/demo/config/test-suite/spark-long-running-non-partitioned.yaml] > deltastreamer with following properties: > [https://github.com/apache/hudi/blob/dbfe8b23c0b4f160b26379053873cfc2a46acef4/docker/demo/config/test-suite/test-nonpartitioned.properties] > The job throws NPE during validation phase: > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 69.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 69.0 (TID 345) (10.0.103.207 executor 1): > java.lang.NullPointerException at > org.apache.avro.JsonProperties$2$1$1.(JsonProperties.java:175) at > org.apache.avro.JsonProperties$2$1.iterator(JsonProperties.java:174) at > org.apache.avro.JsonProperties.getObjectProps(JsonProperties.java:305) at > org.apache.hudi.avro.AvroSchemaUtils.createNewSchemaFromFieldsWithReference(AvroSchemaUtils.java:306) > at > org.apache.hudi.avro.AvroSchemaUtils.appendFieldsToSchemaBase(AvroSchemaUtils.java:293) > at > org.apache.hudi.avro.AvroSchemaUtils.appendFieldsToSchemaDedupNested(AvroSchemaUtils.java:245) > at > org.apache.hudi.common.table.read.HoodieFileGroupReaderSchemaHandler.generateRequiredSchema(HoodieFileGroupReaderSchemaHandler.java:146) > at > org.apache.hudi.common.table.read.HoodieFileGroupReaderSchemaHandler.prepareRequiredSchema(HoodieFileGroupReaderSchemaHandler.java:150) > at > org.apache.hudi.common.table.read.HoodieFileGroupReaderSchemaHandler.(HoodieFileGroupReaderSchemaHandler.java:84) > at > org.apache.hudi.common.table.read.HoodieFileGroupReader.(HoodieFileGroupReader.java:113) > at > org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$3(HoodieFileGroupReaderBasedParquetFileFormat.scala:170) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:209) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:270) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown > Source) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > at org.apache.spark.scheduler.Task.run(Task.scala:136) at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) {code} > It seems like the code assumes that all schema must have properties, which > may not necessaily be true. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7965) Clean up SchemaTestUtil code
[ https://issues.apache.org/jira/browse/HUDI-7965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7965: - Labels: pull-request-available (was: ) > Clean up SchemaTestUtil code > > > Key: HUDI-7965 > URL: https://issues.apache.org/jira/browse/HUDI-7965 > Project: Apache Hudi > Issue Type: Improvement >Reporter: bradley >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7963) Avoid generating RLI records when disabled w/ MDT
[ https://issues.apache.org/jira/browse/HUDI-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7963: - Labels: pull-request-available (was: ) > Avoid generating RLI records when disabled w/ MDT > - > > Key: HUDI-7963 > URL: https://issues.apache.org/jira/browse/HUDI-7963 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata >Reporter: sivabalan narayanan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7961) Optimize UpsertPartitioner for prepped write operations
[ https://issues.apache.org/jira/browse/HUDI-7961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7961: - Labels: pull-request-available (was: ) > Optimize UpsertPartitioner for prepped write operations > --- > > Key: HUDI-7961 > URL: https://issues.apache.org/jira/browse/HUDI-7961 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: sivabalan narayanan >Priority: Major > Labels: pull-request-available > > We have avg record size calculation etc in UpsertPartitioner which does not > makes sense for prepped write operations. also, w/ MDT, we can optimize > these. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7958) Create partition stats index for all columns when no columns specified
[ https://issues.apache.org/jira/browse/HUDI-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7958: - Labels: pull-request-available (was: ) > Create partition stats index for all columns when no columns specified > -- > > Key: HUDI-7958 > URL: https://issues.apache.org/jira/browse/HUDI-7958 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Just like column stats index, we can create partition stats index for all > column if no columns configured by the user. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7957) data skew when writing with bulk_insert + bucket_index enabled
[ https://issues.apache.org/jira/browse/HUDI-7957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7957: - Labels: pull-request-available (was: ) > data skew when writing with bulk_insert + bucket_index enabled > -- > > Key: HUDI-7957 > URL: https://issues.apache.org/jira/browse/HUDI-7957 > Project: Apache Hudi > Issue Type: Improvement > Components: spark-sql >Reporter: KnightChess >Assignee: KnightChess >Priority: Major > Labels: pull-request-available > > as [https://github.com/apache/hudi/issues/11565] say, when use bulk insert > as row if table is bucket, data will skew, because of the partitioner > algorithm -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7955) Account for WritableTimestampObjectInspector#getPrimitiveJavaObject Hive3 and Hive2 discrepancies
[ https://issues.apache.org/jira/browse/HUDI-7955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7955: - Labels: pull-request-available (was: ) > Account for WritableTimestampObjectInspector#getPrimitiveJavaObject Hive3 and > Hive2 discrepancies > - > > Key: HUDI-7955 > URL: https://issues.apache.org/jira/browse/HUDI-7955 > Project: Apache Hudi > Issue Type: Bug >Reporter: voon >Assignee: voon >Priority: Major > Labels: pull-request-available > Attachments: image-2024-07-05-18-11-33-420.png, > image-2024-07-05-18-13-28-135.png > > > The invocation of *getPrimitiveJavaObject* returns a different implementation > of timestamp in Hive3 and Hive2. > - Hive2: *java.sql.Timestamp* > - Hive3: *org.apache.hadoop.hive.common.type.Timestamp* > Hudi common is compiled with Hive2, but Trino is using Hive3, causing the > discrepancy between compile and runtime. When execution flow falls into this > section of the code where the trigger conditions are listed below: > 1. MOR table is used > 2. User is querying the _rt table > 3. User's table has a *TIMESTAMP* type and query requires it > 4. Merge is required as record is present in both Parquet and Log file > Error below will be thrown: > {code:java} > Query 20240704_075218_05052_yfmfc failed: 'java.sql.Timestamp > org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(java.lang.Object)' > java.lang.NoSuchMethodError: 'java.sql.Timestamp > org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(java.lang.Object)' > at > org.apache.hudi.hadoop.utils.HiveAvroSerializer.serializePrimitive(HiveAvroSerializer.java:304) > at > org.apache.hudi.hadoop.utils.HiveAvroSerializer.serialize(HiveAvroSerializer.java:212) > at > org.apache.hudi.hadoop.utils.HiveAvroSerializer.setUpRecordFieldFromWritable(HiveAvroSerializer.java:121) > at > org.apache.hudi.hadoop.utils.HiveAvroSerializer.serialize(HiveAvroSerializer.java:108) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.convertArrayWritableToHoodieRecord(RealtimeCompactedRecordReader.java:185) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.mergeRecord(RealtimeCompactedRecordReader.java:172) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:114) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:49) > at > org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:88) > at > org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36) > at > io.trino.plugin.hive.GenericHiveRecordCursor.advanceNextPosition(GenericHiveRecordCursor.java:215) > at > io.trino.spi.connector.RecordPageSource.getNextPage(RecordPageSource.java:88) > at > io.trino.plugin.hudi.HudiPageSource.getNextPage(HudiPageSource.java:120){code} > h1. Hive3 > !image-2024-07-05-18-11-33-420.png|width=509,height=572! > h1. Hive2 > !image-2024-07-05-18-13-28-135.png|width=507,height=501! > > h1. How to reproduce > > > {code:java} > CREATE TABLE dev_hudi.hudi_7955__hive3_timestamp_issue ( > id INT, > name STRING, > timestamp_col TIMESTAMP, > grass_region STRING > ) USING hudi > PARTITIONED BY (grass_region) > tblproperties ( > primaryKey = 'id', > type = 'mor', > precombineField = 'id', > hoodie.index.type = 'BUCKET', > hoodie.index.bucket.engine = 'CONSISTENT_HASHING', > hoodie.compact.inline = 'true' > ) > LOCATION 'hdfs://path/to/hudi_tables/hudi_7955__hive3_timestamp_issue'; > -- 5 separate commits to trigger compaction > INSERT INTO dev_hudi.hudi_7955__hive3_timestamp_issue VALUES (1, 'alex1', > now(), 'SG'); > -- No error here as there no MERGE is required between Parquet + Log > SELECT _hoodie_file_name, id, timestamp_col FROM > dev_hudi.hudi_7955__hive3_timestamp_issue_rt WHERE _hoodie_file_name NOT LIKE > '%parquet%'; > INSERT INTO dev_hudi.hudi_7955__hive3_timestamp_issue VALUES (2, 'alex2', > now(), 'SG'
[jira] [Updated] (HUDI-7954) Fix data skipping with secondary index when there are no log files
[ https://issues.apache.org/jira/browse/HUDI-7954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7954: - Labels: pull-request-available (was: ) > Fix data skipping with secondary index when there are no log files > -- > > Key: HUDI-7954 > URL: https://issues.apache.org/jira/browse/HUDI-7954 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0-beta2, 1.0.0 > > > When there are no log files in index, then the lookup returns no secondary > keys or candidate files, because of a bug - `logRecordsMap` is empty in this > code and base file records are ignored - > [https://github.com/apache/hudi/blob/70f44efe298771fcef9d029820a9b431e1ff165c/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java#L970] > Even though current tests for pruning asserts the filtered files count < > total data files count. It is weak in the sense that it does not filtered > files count > 0, and hence the assertion passed even when filtered files > count = 0. Ultimately, all files were getting scanned. We should fix this > behavior -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7953) Improved the variable naming and formatting of HoodieActiveTimeline and HoodieIndex
[ https://issues.apache.org/jira/browse/HUDI-7953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7953: - Labels: pull-request-available (was: ) > Improved the variable naming and formatting of HoodieActiveTimeline and > HoodieIndex > --- > > Key: HUDI-7953 > URL: https://issues.apache.org/jira/browse/HUDI-7953 > Project: Apache Hudi > Issue Type: Improvement >Reporter: bradley >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6510) Java 17 compile time support
[ https://issues.apache.org/jira/browse/HUDI-6510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6510: - Labels: pull-request-available (was: ) > Java 17 compile time support > > > Key: HUDI-6510 > URL: https://issues.apache.org/jira/browse/HUDI-6510 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Udit Mehrotra >Assignee: Shawn Chang >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Certify Hudi with Java 17 compile time support -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7929) Add Flink Hudi Example for K8s
[ https://issues.apache.org/jira/browse/HUDI-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7929: - Labels: pull-request-available (was: ) > Add Flink Hudi Example for K8s > -- > > Key: HUDI-7929 > URL: https://issues.apache.org/jira/browse/HUDI-7929 > Project: Apache Hudi > Issue Type: New Feature > Components: flink >Reporter: Zhenqiu Huang >Assignee: Zhenqiu Huang >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7949) insert into hudi table with columns specified(reordered and not in table schema order) throws exception
[ https://issues.apache.org/jira/browse/HUDI-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7949: - Labels: pull-request-available (was: ) > insert into hudi table with columns specified(reordered and not in table > schema order) throws exception > --- > > Key: HUDI-7949 > URL: https://issues.apache.org/jira/browse/HUDI-7949 > Project: Apache Hudi > Issue Type: New Feature > Components: spark-sql >Reporter: KnightChess >Assignee: KnightChess >Priority: Major > Labels: pull-request-available > > https://github.com/apache/hudi/issues/11552 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7937) Fix handling of decimals in StreamSync and Clustering
[ https://issues.apache.org/jira/browse/HUDI-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7937: - Labels: pull-request-available (was: ) > Fix handling of decimals in StreamSync and Clustering > - > > Key: HUDI-7937 > URL: https://issues.apache.org/jira/browse/HUDI-7937 > Project: Apache Hudi > Issue Type: Bug >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > Labels: pull-request-available > > When decimals are using a small precision, we need to write them in legacy > format to ensure all hudi components can read them back. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7951) Classes using avro causing conflict in hudi-aws-bundle
[ https://issues.apache.org/jira/browse/HUDI-7951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7951: - Labels: pull-request-available (was: ) > Classes using avro causing conflict in hudi-aws-bundle > -- > > Key: HUDI-7951 > URL: https://issues.apache.org/jira/browse/HUDI-7951 > Project: Apache Hudi > Issue Type: Bug >Reporter: Shawn Chang >Priority: Major > Labels: pull-request-available > > Hudi 0.15 added some Hudi classes with avro usages > (ParquetTableSchemaResolver in this case), also had hudi-aws-bundle depend on > hudi-hadoop-common. hudi-aws-bundle won't relocate avro classes to be > compatible with hudi-spark. > > The issue would happen when using hudi-flink-bundle with hudi-aws-bundle. > hudi-flink-bundle has relocated avro classes and would cause class conflict: > {code:java} > java.lang.NoSuchMethodError: 'org.apache.parquet.schema.MessageType > org.apache.hudi.common.table.ParquetTableSchemaResolver.convertAvroSchemaToParquet(org.apache.hudi.org.apache.avro.Schema, > org.apache.hadoop.conf.Configuration)' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7950) Shade roaring bitmap dependency in root POM
[ https://issues.apache.org/jira/browse/HUDI-7950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7950: - Labels: pull-request-available (was: ) > Shade roaring bitmap dependency in root POM > --- > > Key: HUDI-7950 > URL: https://issues.apache.org/jira/browse/HUDI-7950 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0-beta2, 1.0.0, 0.15.1 > > > We should unify the shading rule of roaring bitmap dependency in the root POM > for consistency among bundles. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7941) add show_file_status procedure
[ https://issues.apache.org/jira/browse/HUDI-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7941: - Labels: pull-request-available (was: ) > add show_file_status procedure > -- > > Key: HUDI-7941 > URL: https://issues.apache.org/jira/browse/HUDI-7941 > Project: Apache Hudi > Issue Type: New Feature >Reporter: 陈磊 >Priority: Major > Labels: pull-request-available > Fix For: 1.1.0 > > > When incrementally consuming the hudi table or performing clustering or > compaction operations on the hudi table, it is often found that a certain > file does not exist. If you want to know which operation deleted the file, it > is a very troublesome operation. For this purpose, we provide a tool > `show_file_status` to view whether a specified file has been deleted and what > actions have been taken to delete it. > usage: > call show_file_status(table => '$tableName', partition => '$partition', file > => '$fileName') > call show_file_status(table => '$tableName', file => '$fileName') > output: > 1)the file was deleted by the restore action > +---+---+-++-+ > |status |action |instant |timeline|full_path| > +---+---+-++-+ > |deleted|restore|20240629225539880|active | | > +---+---+-++-+ > 2)the file has been deleted in other ways, such as hdfs dfs -rm > +---+--+---++-+ > |status |action|instant|timeline|full_path| > +---+--+---++-+ > |unknown| | || | > +---+--+---++-+ > 3) the file exists > +--+--+---++---+ > |status|action|instant|timeline|full_path > >| > +--+--+---++---+ > |exist | | |active > |/Users/xx/xx/others/data/hudi-warehouse/source1/hudi_mor_append/sex=0/85ad0f44-22bf-4733-99bf-06382d6eacd5-0_0-130-89_20240629230123162.parquet| > +--+--+---++---+ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7948) RFC-80: Support column families for wide tables
[ https://issues.apache.org/jira/browse/HUDI-7948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7948: - Labels: pull-request-available (was: ) > RFC-80: Support column families for wide tables > --- > > Key: HUDI-7948 > URL: https://issues.apache.org/jira/browse/HUDI-7948 > Project: Apache Hudi > Issue Type: Task >Reporter: Vova Kolmakov >Assignee: Vova Kolmakov >Priority: Major > Labels: pull-request-available > > Write, discuss, approve RFC document in github -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7943) Resolve version conflict of fasterxml on spark3.2
[ https://issues.apache.org/jira/browse/HUDI-7943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7943: - Labels: pull-request-available (was: ) > Resolve version conflict of fasterxml on spark3.2 > -- > > Key: HUDI-7943 > URL: https://issues.apache.org/jira/browse/HUDI-7943 > Project: Apache Hudi > Issue Type: Bug > Components: dependencies > Environment: hudi0.14.1, Spark3.2 >Reporter: Jihwan Lee >Priority: Major > Labels: pull-request-available > > When run streaming read on spark3.2, raise exception that requires correct > version of jackson databind. > Spark versions except 3.2 seem to use versions related to Spark dependencies. > > version refer: https://github.com/apache/spark/blob/v3.2.3/pom.xml#L170 > > example code: > > {code:java} > import scala.collection.JavaConversions._ > import org.apache.spark.sql.SaveMode._ > import org.apache.hudi.DataSourceReadOptions._ > import org.apache.hudi.DataSourceWriteOptions._ > import org.apache.hudi.common.table.HoodieTableConfig._ > import org.apache.hudi.config.HoodieWriteConfig._ > import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ > import org.apache.hudi.common.model.HoodieRecord > import spark.implicits._ > val basePath = "hdfs:///tmp/trips_table" > spark.readStream > .format("hudi") > .option("hoodie.datasource.query.type", "incremental") > .option("hoodie.datasource.query.incremental.format", "cdc") > .load(basePath) > .writeStream > .format("console") > .option("checkpointLocation", "/tmp/trips_table_checkpoint") > .outputMode("append") > .start().awaitTermination() > {code} > > > error log: > > {code:java} > Caused by: java.lang.ExceptionInInitializerError: > com.fasterxml.jackson.databind.JsonMappingException: Scala module 2.10.0 > requires Jackson Databind version >= 2.10.0 and < 2.11.0 > at > org.apache.spark.sql.hudi.streaming.HoodieSourceOffset.(HoodieSourceOffset.scala:30) > at > org.apache.spark.sql.hudi.streaming.HoodieStreamSource.getLatestOffset(HoodieStreamSource.scala:127) > at > org.apache.spark.sql.hudi.streaming.HoodieStreamSource.getOffset(HoodieStreamSource.scala:138) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$6(MicroBatchExecution.scala:403) > at > org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:375) > at > org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:373) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:69) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:402) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:384) > at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:627) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:380) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:210) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:375) > at > org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:373) > at > org.apache.spark.sql.execution.streaming
[jira] [Updated] (HUDI-7883) Ensure 1.x commit instants are readable w/ 0.16.0
[ https://issues.apache.org/jira/browse/HUDI-7883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7883: - Labels: pull-request-available (was: ) > Ensure 1.x commit instants are readable w/ 0.16.0 > -- > > Key: HUDI-7883 > URL: https://issues.apache.org/jira/browse/HUDI-7883 > Project: Apache Hudi > Issue Type: Improvement >Reporter: sivabalan narayanan >Assignee: Sagar Sumit >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > > Ensure 1.x commit instants are readable w/ 0.16.0 reader. > > May be we need to migrate HoodieInstant parsing logic to 0.16.0 in a > backwards compatible manner. or its already ported. we just need to write > tests and validate. > [https://github.com/apache/hudi/pull/9617] - contains some portion > (HoodieInstant changes and some method renames) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7945) Fix file pruning using PARTITION_STATS index in Spark
[ https://issues.apache.org/jira/browse/HUDI-7945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7945: - Labels: pull-request-available (was: ) > Fix file pruning using PARTITION_STATS index in Spark > - > > Key: HUDI-7945 > URL: https://issues.apache.org/jira/browse/HUDI-7945 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0-beta2, 1.0.0 > > > The issue can be reproduced by > [https://github.com/apache/hudi/pull/11472#issuecomment-2199332859.] > When there are more than one base files in a table partition, the > corresponding PARTITION_STATS index record in the metadata table contains > null as the file_path field in HoodieColumnRangeMetadata. > {code:java} > private static > HoodieColumnRangeMetadata > mergeRanges(HoodieColumnRangeMetadata one, > > HoodieColumnRangeMetadata another) { > > ValidationUtils.checkArgument(one.getColumnName().equals(another.getColumnName()), > "Column names should be the same for merging column ranges"); > final T minValue = getMinValueForColumnRanges(one, another); > final T maxValue = getMaxValueForColumnRanges(one, another); > return HoodieColumnRangeMetadata.create( > null, one.getColumnName(), minValue, maxValue, > one.getNullCount() + another.getNullCount(), > one.getValueCount() + another.getValueCount(), > one.getTotalSize() + another.getTotalSize(), > one.getTotalUncompressedSize() + another.getTotalUncompressedSize()); > } > {code} > The null causes NPE when loading the column stats per partition from > PARTITION_STATS index. Also, current implementation of > PartitionStatsIndexSupport assumes that the file_path field contains the > exact file name and it does not work if the the file path does not contain > null (even a list of file names stored does not work). We have to > reimplement PartitionStatsIndexSupport so that it gives the pruned partitions > for further processing. > {code:java} > Caused by: java.lang.NullPointerException: element cannot be mapped to a null > key > at java.util.Objects.requireNonNull(Objects.java:228) > at java.util.stream.Collectors.lambda$groupingBy$45(Collectors.java:907) > at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169) > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > at > java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) > at > java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > at java.util.Iterator.forEachRemaining(Iterator.java:116) > at > java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) > at > java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) > at > java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272) > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747) > at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721) > at java.util.stream.AbstractTask.compute(AbstractTask.java:327) > at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) > at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) > at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) > at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) > at > java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) > at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) > at > org.apache.hudi.common.data.HoodieListPairData.groupByKey(HoodieListPairData.java:115) > at > org.apache.hudi.ColumnStatsIndexSupport.transpose(ColumnStatsIndexSupport.scala:253) > at > org.apache.hudi.ColumnStatsIndexSupport.$anonfun$loadTransposed$1(ColumnStatsIndexSupport.scala:149) > at > org.apache.hudi.Ho
[jira] [Updated] (HUDI-7940) Pass metrics to ErrorTableWriter to be able to emit metrics for Error Table
[ https://issues.apache.org/jira/browse/HUDI-7940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7940: - Labels: pull-request-available (was: ) > Pass metrics to ErrorTableWriter to be able to emit metrics for Error Table > --- > > Key: HUDI-7940 > URL: https://issues.apache.org/jira/browse/HUDI-7940 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Rajesh Mahindra >Assignee: Rajesh Mahindra >Priority: Minor > Labels: pull-request-available > > Pass metrics to ErrorTableWriter to be able to emit metrics for Error Table -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0
[ https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7882: - Labels: pull-request-available (was: ) > Umbrella ticket to track all changes required to support reading 1.x tables > with 0.16.0 > > > Key: HUDI-7882 > URL: https://issues.apache.org/jira/browse/HUDI-7882 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0, 1.0.0 > > > We wanted to support reading 1.x tables in 0.16.0 release. So, creating this > umbrella ticket to track all of them. > > RFC in progress: [https://github.com/apache/hudi/pull/11514] > > Changes required to be ported: > 0. Creating 0.16.0 branch > 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. > > 1. Timeline > 1.a Hoodie instant parsing should be able to read 1.x instants. > https://issues.apache.org/jira/browse/HUDI-7883 Sagar. > 1.b Commit metadata parsing is able to handle both json and avro formats. > Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 > Siva. > 1.c HoodieDefaultTimeline able to read both timelines based on table version. > https://issues.apache.org/jira/browse/HUDI-7884 Siva. > 1.d Reading LSM timeline using 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7890 Siva. > 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 > > 2. Table property changes > 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 > https://issues.apache.org/jira/browse/HUDI-7865 LJ > > 3. MDT table changes > 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ > 3.b MDT payload schema changes. > https://issues.apache.org/jira/browse/HUDI-7886 LJ > > 4. Log format changes > 4.a All metadata header types porting > https://issues.apache.org/jira/browse/HUDI-7887 Jon > 4.b Meaningful error for incompatible features from 1.x > https://issues.apache.org/jira/browse/HUDI-7888 Jon > > 5. Log file slice or grouping detection compatibility > > 5. Tests > 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. > > 6 Doc changes > 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. > https://issues.apache.org/jira/browse/HUDI-7889 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7905) Use cluster action for clustering pending instants
[ https://issues.apache.org/jira/browse/HUDI-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7905: - Labels: pull-request-available (was: ) > Use cluster action for clustering pending instants > -- > > Key: HUDI-7905 > URL: https://issues.apache.org/jira/browse/HUDI-7905 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Currently, we use replacecommit for clustering, insert overwrite and delete > partition. Clustering should be a separate action for requested and inflight > instant. This simplifies a few things such as we do not need to scan the > replacecommit.requested to determine whether we are looking at clustering > plan or not. This would simplify the usage of pending clustering related > APIs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7859) Rename instant files to be consistent with 0.x naming format
[ https://issues.apache.org/jira/browse/HUDI-7859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7859: - Labels: pull-request-available (was: ) > Rename instant files to be consistent with 0.x naming format > > > Key: HUDI-7859 > URL: https://issues.apache.org/jira/browse/HUDI-7859 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: YangXuan >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Needed for downgrade -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7915) Spark 4 support
[ https://issues.apache.org/jira/browse/HUDI-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7915: - Labels: pull-request-available (was: ) > Spark 4 support > --- > > Key: HUDI-7915 > URL: https://issues.apache.org/jira/browse/HUDI-7915 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Shawn Chang >Priority: Major > Labels: pull-request-available > > Spark 4.0.0-preview1 is out. We should start integrating Hudi with Spark 4 > and surface any issues early on. > https://spark.apache.org/news/spark-4.0.0-preview1.html -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4822) Extract the baseFile and logFIles from HoodieDeltaWriteStat in the right way
[ https://issues.apache.org/jira/browse/HUDI-4822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4822: - Labels: pull-request-available (was: ) > Extract the baseFile and logFIles from HoodieDeltaWriteStat in the right way > > > Key: HUDI-4822 > URL: https://issues.apache.org/jira/browse/HUDI-4822 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Yann Byron >Assignee: Vova Kolmakov >Priority: Major > Labels: pull-request-available > > currently, we can't get the `baseFile` and `logFiles` members from > `HoodieDeltaWriteStat` directly. That's because it lost the related > information after deserialization from the commit files. So we need to > improve this. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7903) Partition Stats Index not getting created with SQL
[ https://issues.apache.org/jira/browse/HUDI-7903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7903: - Labels: pull-request-available (was: ) > Partition Stats Index not getting created with SQL > -- > > Key: HUDI-7903 > URL: https://issues.apache.org/jira/browse/HUDI-7903 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0-beta2, 1.0.0 > > > {code:java} > spark.sql( > s""" > | create table $tableName using hudi > | partitioned by (dt) > | tblproperties( > |primaryKey = 'id', > |preCombineField = 'ts', > |'hoodie.metadata.index.partition.stats.enable' = 'true' > | ) > | location '$tablePath' > | AS > | select 1 as id, 'a1' as name, 10 as price, 1000 as ts, > cast('2021-05-06' as date) as dt >""".stripMargin > ) {code} > Even when partition stats is enabled, index is not created with SQL. Works > for datasource. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7926) dataskipping failure mode should be strict in test
[ https://issues.apache.org/jira/browse/HUDI-7926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7926: - Labels: pull-request-available (was: ) > dataskipping failure mode should be strict in test > -- > > Key: HUDI-7926 > URL: https://issues.apache.org/jira/browse/HUDI-7926 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: KnightChess >Assignee: KnightChess >Priority: Critical > Labels: pull-request-available > > dataskipping failure mode should be strict in test. if use fallback mode > default, the query ut is meaningless. > There may be other codes that have been introduced into bugs but cannot be > measured. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7709) Class Cast Exception while reading the data using TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7709: - Labels: pull-request-available (was: ) > Class Cast Exception while reading the data using TimestampBasedKeyGenerator > > > Key: HUDI-7709 > URL: https://issues.apache.org/jira/browse/HUDI-7709 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: Aditya Goenka >Assignee: Geser Dugarov >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0 > > > Github Issue - [https://github.com/apache/hudi/issues/11140] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7924) Capture Latency and Failure Metrics For Hive Table recreation
[ https://issues.apache.org/jira/browse/HUDI-7924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7924: - Labels: pull-request-available (was: ) > Capture Latency and Failure Metrics For Hive Table recreation > - > > Key: HUDI-7924 > URL: https://issues.apache.org/jira/browse/HUDI-7924 > Project: Apache Hudi > Issue Type: Task >Reporter: Vamsi Karnika >Priority: Major > Labels: pull-request-available > > As part of recreating the glue and hive table whenever sync schema or > partition fails, we want to capture and push metrics related to latency(time > taken to recreate and sync the table) and a failure metric(when recreating > the table fails). * Push Latency metric to capture time taken to recreate and > sync the table > * Push a failure metric if recreate and sync fails. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7922) Add Hudi CLI bundle for Scala 2.13
[ https://issues.apache.org/jira/browse/HUDI-7922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7922: - Labels: pull-request-available (was: ) > Add Hudi CLI bundle for Scala 2.13 > -- > > Key: HUDI-7922 > URL: https://issues.apache.org/jira/browse/HUDI-7922 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Build of Hudi CLI bundle should succeed on Scala 2.13 and work on Spark 3.5 > and Scala 2.13. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7921) Chase down memory leaks in Writeclient with MDT enabled
[ https://issues.apache.org/jira/browse/HUDI-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7921: - Labels: pull-request-available (was: ) > Chase down memory leaks in Writeclient with MDT enabled > --- > > Key: HUDI-7921 > URL: https://issues.apache.org/jira/browse/HUDI-7921 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > We see OOMs when deltastreamer is running continuously for days together. We > suspect some memory leaks when metadata table is enabled. Lets try to chase > down all of them and fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7911) Enable cdc log for MOR table
[ https://issues.apache.org/jira/browse/HUDI-7911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7911: - Labels: pull-request-available (was: ) > Enable cdc log for MOR table > > > Key: HUDI-7911 > URL: https://issues.apache.org/jira/browse/HUDI-7911 > Project: Apache Hudi > Issue Type: New Feature > Components: core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7920) Make Spark 3.5 the default build profile for Spark integration
[ https://issues.apache.org/jira/browse/HUDI-7920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7920: - Labels: pull-request-available (was: ) > Make Spark 3.5 the default build profile for Spark integration > -- > > Key: HUDI-7920 > URL: https://issues.apache.org/jira/browse/HUDI-7920 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Currently, Spark 3.2 is the default build profile. Given Spark 3.2 is no > longer actively maintained (latest Spark 3.2.x release i from April 2023), we > should upgrade the default build profile on Spark to Spark 3.5 to maintain > the support on latest Spark release. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7914) Incorrect schema produced in DELETE_PARTITION replacecommit
[ https://issues.apache.org/jira/browse/HUDI-7914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7914: - Labels: pull-request-available (was: ) > Incorrect schema produced in DELETE_PARTITION replacecommit > --- > > Key: HUDI-7914 > URL: https://issues.apache.org/jira/browse/HUDI-7914 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vitali Makarevich >Priority: Major > Labels: pull-request-available > > in the current scenario delete_partitions produces {{replacecommit}} with > internal fields - like {{{}_hoodie_file_name{}}}, while e.g. normal > {{commit}} produces schema without such fields. > This leads to unexpected behavior when the {{replacecommit}} is the last on > the commitline, > e.g. [#10258|https://github.com/apache/hudi/issues/10258] > [#10533|https://github.com/apache/hudi/issues/10533] > and e.g. metadata sync things, or any other potential write will take > incorrect schema - and in the best case will fail because fields are > duplicated, in the worst cases can lead to dataloss. > The problem introduced here [https://github.com/apache/hudi/pull/5610/files] > And for other operations like {{delete}} the same approach used as I use now. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7909) Add Comment to the FieldSchema returned by Aws Glue Client
[ https://issues.apache.org/jira/browse/HUDI-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7909: - Labels: pull-request-available (was: ) > Add Comment to the FieldSchema returned by Aws Glue Client > --- > > Key: HUDI-7909 > URL: https://issues.apache.org/jira/browse/HUDI-7909 > Project: Apache Hudi > Issue Type: Task >Reporter: Vamsi Karnika >Priority: Major > Labels: pull-request-available > > The Implementation of getMetastoreFieldSchema by AwsGlueCatalogSyncClient > doesn't included comment as part of the FieldSchema. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7906) improve the parallelism deduce in rdd write
[ https://issues.apache.org/jira/browse/HUDI-7906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7906: - Labels: pull-request-available (was: ) > improve the parallelism deduce in rdd write > --- > > Key: HUDI-7906 > URL: https://issues.apache.org/jira/browse/HUDI-7906 > Project: Apache Hudi > Issue Type: Improvement >Reporter: KnightChess >Assignee: KnightChess >Priority: Major > Labels: pull-request-available > > as [https://github.com/apache/hudi/issues/11274] and > [https://github.com/apache/hudi/pull/11463] describe, there has two case > question. > # if the rdd is input rdd without shuffle, the partitiion number is too > bigger or too small > # user need can not control it easy > ## in some case user can set `spark.default.parallelism` change it. > ## in some case user can not change because hard-code > ## and in spark, the better way is use `spark.default.parallelism` or > `spark.sql.shuffle.partitions` can control it, other is advanced in hudi. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7877) Add record position to record index metadata payload
[ https://issues.apache.org/jira/browse/HUDI-7877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7877: - Labels: pull-request-available (was: ) > Add record position to record index metadata payload > > > Key: HUDI-7877 > URL: https://issues.apache.org/jira/browse/HUDI-7877 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > RLI should save the record position so that can be used in the index lookup. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7892) Building workload support set parallelism
[ https://issues.apache.org/jira/browse/HUDI-7892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7892: - Labels: pull-request-available (was: ) > Building workload support set parallelism > - > > Key: HUDI-7892 > URL: https://issues.apache.org/jira/browse/HUDI-7892 > Project: Apache Hudi > Issue Type: Improvement > Components: spark-sql >Reporter: xy >Assignee: xy >Priority: Major > Labels: pull-request-available > > Building workload support set parallelism -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7891) Fix HoodieActiveTimeline#deleteCompletedRollback missing check for Action type
[ https://issues.apache.org/jira/browse/HUDI-7891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7891: - Labels: pull-request-available (was: ) > Fix HoodieActiveTimeline#deleteCompletedRollback missing check for Action type > -- > > Key: HUDI-7891 > URL: https://issues.apache.org/jira/browse/HUDI-7891 > Project: Apache Hudi > Issue Type: Improvement >Reporter: bradley >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7881) Handle table base path changes in meta syncs.
[ https://issues.apache.org/jira/browse/HUDI-7881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7881: - Labels: pull-request-available (was: ) > Handle table base path changes in meta syncs. > - > > Key: HUDI-7881 > URL: https://issues.apache.org/jira/browse/HUDI-7881 > Project: Apache Hudi > Issue Type: Improvement > Components: deltastreamer, meta-sync >Reporter: Vinish Reddy >Assignee: Vinish Reddy >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7880) Support extraMetadata in Spark SQL Insert Into
[ https://issues.apache.org/jira/browse/HUDI-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7880: - Labels: pull-request-available (was: ) > Support extraMetadata in Spark SQL Insert Into > -- > > Key: HUDI-7880 > URL: https://issues.apache.org/jira/browse/HUDI-7880 > Project: Apache Hudi > Issue Type: Improvement > Components: spark-sql >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Users want to implement checkpoints similar to those in Hudi DeltaStreamer. > DeltaStreamer is implemented by saving values to extrametadata in a commit > file, with the key deltastreamer.checkpoint.key. We can achieve this in Spark > Client by configuring the parameter `house. datasource. write. commonmeta. > key. prefix`, but in Spark SQL, it is restricted that the prefix of the > configuration parameter must be `hoodie.` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7879) Optimize the redundant creation of HoodieTable in DataSourceInternalWriterHelper and the unnecessary parameters in createTable within BaseHoodieWriteClient.
[ https://issues.apache.org/jira/browse/HUDI-7879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7879: - Labels: pull-request-available (was: ) > Optimize the redundant creation of HoodieTable in > DataSourceInternalWriterHelper and the unnecessary parameters in createTable > within BaseHoodieWriteClient. > > > Key: HUDI-7879 > URL: https://issues.apache.org/jira/browse/HUDI-7879 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ma Jian >Priority: Major > Labels: pull-request-available > > In the initialization method of DataSourceInternalWriterHelper, it currently > creates two identical HoodieTable instances. We should remove one of them. > Also, when comparing the differences between the two HoodieTable instances, I > noticed that the createTable method in BaseHoodieWriteClient includes a > HadoopConfiguration parameter that isn't used by any implemented methods. I'm > not sure why it was designed this way, but I think we can remove it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7876) Use TypedProperties to store the spillable map configs for the FG reader
[ https://issues.apache.org/jira/browse/HUDI-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7876: - Labels: pull-request-available (was: ) > Use TypedProperties to store the spillable map configs for the FG reader > > > Key: HUDI-7876 > URL: https://issues.apache.org/jira/browse/HUDI-7876 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > This takes up 4 params for the fg reader that can just be stored in the > TypedProperties that is already passed in. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7874) Fail to read 2-level structure Parquet
[ https://issues.apache.org/jira/browse/HUDI-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7874: - Labels: pull-request-available (was: ) > Fail to read 2-level structure Parquet > -- > > Key: HUDI-7874 > URL: https://issues.apache.org/jira/browse/HUDI-7874 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vitali Makarevich >Priority: Major > Labels: pull-request-available > > If I have {{"spark.hadoop.parquet.avro.write-old-list-structure", "false"}} > explicitly set - to being able to write nulls inside arrays(the only way), > Hudi starts to write Parquets with the following schema inside: > {{ required group internal_list (LIST) \{ > repeated group list { > required int64 element; > } > }}} > > But if I had some files produced before setting > {{{}"spark.hadoop.parquet.avro.write-old-list-structure", "false"{}}}, they > have the following schema inside > {{ required group internal_list (LIST) \{ > repeated int64 array; > }}} > > And Hudi 0.14.x at least fails to read records from such file - failing with > exception > {{Caused by: java.lang.RuntimeException: Null-value for required field: }} > Even though the contents of arrays is {{{}not null{}}}(it cannot be null in > fact since Avro requires > {{spark.hadoop.parquet.avro.write-old-list-structure}} = {{false}} to write > {{{}null{}}}s. > h3. Expected behavior > Taken from Hudi 0.12.1(not sure what exactly broke that): > # If I have a file with 2 level structure and update(not matter having nulls > inside array or not - both produce the same) arrives with > "spark.hadoop.parquet.avro.write-old-list-structure", "false" - overwrite it > into 3 level.({*}fails in 0.14.1{*}) > # If I have 3 level structure with nulls and update cames(not matter with > nulls or without) - read and write correctly > The simple reproduction of issue can be found here: > [https://github.com/VitoMakarevich/hudi-issue-014] > Highly likely, the problem appeared after Hudi made some changes, so values > from Hadoop conf started to propagate into Reader instance(likely they were > not propagated before). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7875) Remove tablePath from HoodieFileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-7875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7875: - Labels: pull-request-available (was: ) > Remove tablePath from HoodieFileGroupReader > --- > > Key: HUDI-7875 > URL: https://issues.apache.org/jira/browse/HUDI-7875 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > tablePath is stored in the metaclient which is a param. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7873) Remove getStorage method from HoodieReaderContext
[ https://issues.apache.org/jira/browse/HUDI-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7873: - Labels: pull-request-available (was: ) > Remove getStorage method from HoodieReaderContext > - > > Key: HUDI-7873 > URL: https://issues.apache.org/jira/browse/HUDI-7873 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > All implementations of the method were the same, and it was only used by a > test method becuase storage is passed as a param to the fg reader. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7872) Recreate Glue table on certain types of exceptions
[ https://issues.apache.org/jira/browse/HUDI-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7872: - Labels: pull-request-available (was: ) > Recreate Glue table on certain types of exceptions > -- > > Key: HUDI-7872 > URL: https://issues.apache.org/jira/browse/HUDI-7872 > Project: Apache Hudi > Issue Type: Task >Reporter: Vamsi Karnika >Priority: Major > Labels: pull-request-available > > If there are certain types of exceptions (schema changes, unable to add > partitions) re-create the Glue table so that the table continues to be > queryable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7871) Remove tableconfig from HoodieFilegroupReader params
[ https://issues.apache.org/jira/browse/HUDI-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7871: - Labels: pull-request-available (was: ) > Remove tableconfig from HoodieFilegroupReader params > > > Key: HUDI-7871 > URL: https://issues.apache.org/jira/browse/HUDI-7871 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > In prod usages, we just get the tableconfigs from the metaclient. The > constructor has too many params so getting rid of one will be useful. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7867) Data deduplication caused by drawback in the delete invalid files before commit
[ https://issues.apache.org/jira/browse/HUDI-7867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7867: - Labels: pull-request-available (was: ) > Data deduplication caused by drawback in the delete invalid files before > commit > --- > > Key: HUDI-7867 > URL: https://issues.apache.org/jira/browse/HUDI-7867 > Project: Apache Hudi > Issue Type: Bug > Components: core >Reporter: Jing Zhang >Priority: Major > Labels: pull-request-available > > Our user complained that after their daily run job which written to a Hudi > cow table finished, the downstream reading jobs find many duplicate records > today. The daily run job has been already online for a long time, and this is > the first time of such wrong result. > He gives a detailed deduplicated record as example to help debug. The record > appeared in 3 base files which belongs to different file groups. > [!https://private-user-images.githubusercontent.com/1525333/337907952-60b95dc4-91d6-4b40-8bca-c877a4407ae0.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkwNzk1Mi02MGI5NWRjNC05MWQ2LTRiNDAtOGJjYS1jODc3YTQ0MDdhZTAucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZTZhMThjZDdiNjNmYjYyZmU5Mjg3OWIyMTg5ZTFkNDBmMTc5NjliZjFjMjQwZWQwM2JjZjMxNDU4ZDA3NzZhZSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.ueqsTezXNbtnxyqSyzW2_v92Jc0z_7ioljutPcfcWwE|width=491!|https://private-user-images.githubusercontent.com/1525333/337907952-60b95dc4-91d6-4b40-8bca-c877a4407ae0.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkwNzk1Mi02MGI5NWRjNC05MWQ2LTRiNDAtOGJjYS1jODc3YTQ0MDdhZTAucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZTZhMThjZDdiNjNmYjYyZmU5Mjg3OWIyMTg5ZTFkNDBmMTc5NjliZjFjMjQwZWQwM2JjZjMxNDU4ZDA3NzZhZSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.ueqsTezXNbtnxyqSyzW2_v92Jc0z_7ioljutPcfcWwE] > I find the today's writer job, the spark application finished successfully. > In the driver log, I find those two files marked as invalid files which to > delete, only one file is valid files. > [!https://private-user-images.githubusercontent.com/1525333/337909363-8e19e170-e38f-4725-82a5-84ed55750db9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkwOTM2My04ZTE5ZTE3MC1lMzhmLTQ3MjUtODJhNS04NGVkNTU3NTBkYjkucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NzUwMGQ4ODU2NDNmODFiYmE2YjA0OGIzMzBhZGU4OGMxOGYxMTNkZTJjNzZjZDI0N2YwNDRmMWMwY2ZiNWQzOSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.0RruG5y4012v6dHdoqmEEMTT2oLWjmIHQsa_JHl-vmg|width=1380!|https://private-user-images.githubusercontent.com/1525333/337909363-8e19e170-e38f-4725-82a5-84ed55750db9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkwOTM2My04ZTE5ZTE3MC1lMzhmLTQ3MjUtODJhNS04NGVkNTU3NTBkYjkucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NzUwMGQ4ODU2NDNmODFiYmE2YjA0OGIzMzBhZGU4OGMxOGYxMTNkZTJjNzZjZDI0N2YwNDRmMWMwY2ZiNWQzOSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.0RruG5y4012v6dHdoqmEEMTT2oLWjmIHQsa_JHl-vmg] > And in the clean stage task log, those two files are also marked to be > deleted and there is no exception in the task either. > [!https://private-user-images.githubusercontent.com/1525333/33791
[jira] [Updated] (HUDI-7838) Use Config hoodie.schema.cache.enable in HoodieBaseFileGroupRecordBuffer and AbstractHoodieLogRecordReader
[ https://issues.apache.org/jira/browse/HUDI-7838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7838: - Labels: pull-request-available (was: ) > Use Config hoodie.schema.cache.enable in HoodieBaseFileGroupRecordBuffer and > AbstractHoodieLogRecordReader > --- > > Key: HUDI-7838 > URL: https://issues.apache.org/jira/browse/HUDI-7838 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: Jonathan Vexler >Assignee: Vova Kolmakov >Priority: Major > Labels: pull-request-available > > hoodie.schema.cache.enable should be used to decide if we want to use the > schema cache. Currently it is hardcoded to false. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7671) Make Hudi timeline backward compatible
[ https://issues.apache.org/jira/browse/HUDI-7671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7671: - Labels: compatibility pull-request-available (was: compatibility) > Make Hudi timeline backward compatible > -- > > Key: HUDI-7671 > URL: https://issues.apache.org/jira/browse/HUDI-7671 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: compatibility, pull-request-available > Fix For: 1.0.0 > > > Since release 1.x, the timeline metadata file name is changed to include the > completion time, we need to keep compatibility for 0.x branches/releases. > 0.x meta file name pattern: ${instant_time}.action[.state] > 1.x meta file name pattern: ${instant_time}_${completion_time}.action[.state]. > In 1.x release, while decipheriing the Hudi instant from the metadata files, > if there is no completion time, uses the file modification time as the > completion time instead. > The modification time follows the OCC concurrency control semantics if the > files were not moved around. > Caution that if the table is a MOR table and the files got moved in history > from old folder to the current folder, the reader view may represent wong > result set because the completion time are completely the same for all the > alive instants. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7869) Ensure properties are copied when modifying schema
[ https://issues.apache.org/jira/browse/HUDI-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7869: - Labels: pull-request-available (was: ) > Ensure properties are copied when modifying schema > -- > > Key: HUDI-7869 > URL: https://issues.apache.org/jira/browse/HUDI-7869 > Project: Apache Hudi > Issue Type: Bug >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Properties are not always copied when we modify the schema, such as removing > fields. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits
[ https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7779: - Labels: pull-request-available (was: ) > Guarding archival to not archive unintended commits > --- > > Key: HUDI-7779 > URL: https://issues.apache.org/jira/browse/HUDI-7779 > Project: Apache Hudi > Issue Type: Bug > Components: archiving >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0, 1.0.0 > > > Archiving commits from active timeline could lead to data consistency issues > on some rarest of occasions. We should come up with proper guards to ensure > we do not make such unintended archival. > > Major gap which we wanted to guard is: > if someone disabled cleaner, archival should account for data consistency > issues and ensure it bails out. > We have a base guarding condition, where archival will stop at the earliest > commit to retain based on latest clean commit metadata. But there are few > other scenarios that needs to be accounted for. > > a. Keeping aside replace commits, lets dive into specifics for regular > commits and delta commits. > Say user configured clean commits to 4 and archival configs to 5 and 6. after > t10, cleaner is supposed to clean up all file versions created at or before > t6. Say cleaner did not run(for whatever reason for next 5 commits). > Archival will certainly be guarded until earliest commit to retain based > on latest clean commits. > Corner case to consider: > A savepoint was added to say t3 and later removed. and still the cleaner was > never re-enabled. Even though archival would have been stopped at t3 (until > savepoint is present),but once savepoint is removed, if archival is executed, > it could archive commit t3. Which means, file versions tracked at t3 is still > not yet cleaned by the cleaner. > Reasoning: > We are good here wrt data consistency. Up until cleaner runs next time, this > older file versions might be exposed to the end-user. But time travel query > is not intended for already cleaned up commits and hence this is not an > issue. None of snapshot, time travel query or incremental query will run into > issues as they are not supposed to poll for t3. > At any later point, if cleaner is re-enabled, it will take care of cleaning > up file versions tracked at t3 commit. Just that for interim period, some > older file versions might still be exposed to readers. > > b. The more tricky part is when replace commits are involved. Since replace > commit metadata in active timeline is what ensures the replaced file groups > are ignored for reads, before archiving the same, cleaner is expected to > clean them up fully. But are there chances that this could go wrong? > Corner case to consider. Lets add onto above scenario, where t3 has a > savepoint, and t4 is a replace commit which replaced file groups tracked in > t3. > Cleaner will skip cleaning up files tracked by t3(due to the presence of > savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain > will be pointing to t6. And say savepoint for t3 is removed, but cleaner was > disabled. In this state of the timeline, if archival is executed, (since > t3.savepoint is removed), archival might archive t3 and t4.rc. This could > lead to data duplicates as both replaced file groups and new file groups from > t4.rc would be exposed as valid file groups. > > In other words, if we were to summarize the different scenarios: > i. replaced file group is never cleaned up. > - ECTR(Earliest commit to retain) is less than this.rc and we are good. > ii. replaced file group is cleaned up. > - ECTR is > this.rc and is good to archive. > iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full > clean up did not happen. After savepoint is removed, and when archival is > executed, we should avoid archiving the rc of interest. This is the gap we > don't account for as of now. > > We have 3 options to go about to solve this. > Option A: > Let Savepoint deletion flow take care of cleaning up the files its tracking. > cons: > Savepoint's responsibility is not removing any data files. So, from a single > user responsibility rule, this may not be right. Also, this clean up might > need to do what a clean planner might actually be doing. ie. build file > system view, understand if its supposed to be cleaned u
[jira] [Updated] (HUDI-7847) Infer record merge mode during table upgrade
[ https://issues.apache.org/jira/browse/HUDI-7847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7847: - Labels: pull-request-available (was: ) > Infer record merge mode during table upgrade > > > Key: HUDI-7847 > URL: https://issues.apache.org/jira/browse/HUDI-7847 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Geser Dugarov >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Record merge mode is required to dictate the merging behavior in release 1.x, > playing the same role as the payload class config in the release 0.x. During > table upgrade, we need to infer the record merge mode based on the payload > class so it's correctly set. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7841) RLI and secondary index should consider only pruned partitions for file skipping
[ https://issues.apache.org/jira/browse/HUDI-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7841: - Labels: pull-request-available (was: ) > RLI and secondary index should consider only pruned partitions for file > skipping > > > Key: HUDI-7841 > URL: https://issues.apache.org/jira/browse/HUDI-7841 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Even though RLI scans only matching files, it tries to get those candidate > files by iterating over all files from file index. See - > [https://github.com/apache/hudi/blob/f4be74c29471fbd6afff472f8db292e6b1f16f05/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala#L47] > Instead, it can use the `prunedPartitionsAndFileSlices` to only consider > pruned partitions whenever there is a partition predicate. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7855) Add ability to dynamically configure write parallelism for BULK_INSERT for HoodieStreamer
[ https://issues.apache.org/jira/browse/HUDI-7855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7855: - Labels: pull-request-available (was: ) > Add ability to dynamically configure write parallelism for BULK_INSERT for > HoodieStreamer > - > > Key: HUDI-7855 > URL: https://issues.apache.org/jira/browse/HUDI-7855 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Rajesh Mahindra >Assignee: Rajesh Mahindra >Priority: Major > Labels: pull-request-available > > Add ability to dynamically configure write parallelism for BULK_INSERT for > HoodieStreamer. Currently, BULK_INSERT parallelism to configured based on > source parallelism that may be aggressive or conservative depending on other > factors, e.g. partitions written to etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7854) Bump AWS SDK v2 version to 2.25.69
[ https://issues.apache.org/jira/browse/HUDI-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7854: - Labels: pull-request-available (was: ) > Bump AWS SDK v2 version to 2.25.69 > -- > > Key: HUDI-7854 > URL: https://issues.apache.org/jira/browse/HUDI-7854 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0, 1.0.0 > > > The current version of AWS SDK v2 used is 2.18.40 which is 1.5 years old. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7853) Fix missing serDe properties post migration from hiveSync to glueSync
[ https://issues.apache.org/jira/browse/HUDI-7853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7853: - Labels: pull-request-available (was: ) > Fix missing serDe properties post migration from hiveSync to glueSync > - > > Key: HUDI-7853 > URL: https://issues.apache.org/jira/browse/HUDI-7853 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Prathit Malik >Assignee: Prathit Malik >Priority: Major > Labels: pull-request-available > > More info : [https://github.com/apache/hudi/issues/11397] > > After migration to 0.13.1, hudi table path is missing from serde properties > due to which when reading from spark below error is thrown > - org.apache.hudi.exception.HoodieException: 'path' or 'Key: > 'hoodie.datasource.read.paths' , default: null description: Comma separated > list of file paths to read within a Hudi table. since version: version is not > defined deprecated after: version is not defined)' or both must be specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)