[jira] [Updated] (HUDI-7993) Support pruning and skipping with meta fields

2024-07-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7993:
-
Labels: pull-request-available  (was: )

> Support pruning and skipping with meta fields
> -
>
> Key: HUDI-7993
> URL: https://issues.apache.org/jira/browse/HUDI-7993
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8015) [Glue] Fix Glue Meta Sync Failure on base path change

2024-07-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-8015:
-
Labels: pull-request-available  (was: )

> [Glue] Fix Glue Meta Sync Failure on base path change
> -
>
> Key: HUDI-8015
> URL: https://issues.apache.org/jira/browse/HUDI-8015
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vamsi Karnika
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8014) Fix Exception "FileID of partition path xxx=xx does not exist"

2024-07-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-8014:
-
Labels: pull-request-available  (was: )

> Fix Exception "FileID of partition path xxx=xx does not exist"
> --
>
> Key: HUDI-8014
> URL: https://issues.apache.org/jira/browse/HUDI-8014
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Assignee: Vova Kolmakov
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.15.1
>
>
>  [https://github.com/apache/hudi/issues/11202]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8012) Update checkstyle.xml based on the new release

2024-07-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-8012:
-
Labels: pull-request-available  (was: )

> Update checkstyle.xml based on the new release
> --
>
> Key: HUDI-8012
> URL: https://issues.apache.org/jira/browse/HUDI-8012
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
>
> checkstyle.xml works with older checkstyle release, 8.20, only.  We need to 
> make it work with recent checkstyle releases in IntelliJ(e.g., 10.12.5), see 
> following error.
> {code:java}
> com.puppycrawl.tools.checkstyle.api.CheckstyleException: cannot initialize 
> module TreeWalker - TreeWalker is not allowed as a parent of LineLength 
> Please review 'Parent Module' section for this Check in web documentation if 
> Check is standard.
>     at com.puppycrawl.tools.checkstyle.Checker.setupChild(Checker.java:486)
>     at 
> com.puppycrawl.tools.checkstyle.AbstractAutomaticBean.configure(AbstractAutomaticBean.java:207)
>     at 
> org.infernus.idea.checkstyle.service.cmd.OpCreateChecker.execute(OpCreateChecker.java:61)
>     at 
> org.infernus.idea.checkstyle.service.cmd.OpCreateChecker.execute(OpCreateChecker.java:26)
>     at 
> org.infernus.idea.checkstyle.service.CheckstyleActionsImpl.executeCommand(CheckstyleActionsImpl.java:130)
>     at 
> org.infernus.idea.checkstyle.service.CheckstyleActionsImpl.createChecker(CheckstyleActionsImpl.java:60)
>     at 
> org.infernus.idea.checkstyle.service.CheckstyleActionsImpl.createChecker(CheckstyleActionsImpl.java:51)
>     at 
> org.infernus.idea.checkstyle.checker.CheckerFactoryWorker.run(CheckerFactoryWorker.java:46)
> Caused by: com.puppycrawl.tools.checkstyle.api.CheckstyleException: 
> TreeWalker is not allowed as a parent of LineLength Please review 'Parent 
> Module' section for this Check in web documentation if Check is standard.
>     at 
> com.puppycrawl.tools.checkstyle.TreeWalker.setupChild(TreeWalker.java:140)
>     at 
> com.puppycrawl.tools.checkstyle.AbstractAutomaticBean.configure(AbstractAutomaticBean.java:207)
>     at com.puppycrawl.tools.checkstyle.Checker.setupChild(Checker.java:481)
>     ... 7 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8011) Allow schema.on.read with positional merging

2024-07-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-8011:
-
Labels: pull-request-available  (was: )

> Allow schema.on.read with positional merging
> 
>
> Key: HUDI-8011
> URL: https://issues.apache.org/jira/browse/HUDI-8011
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core, spark
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> Internal schema doesn't have the positional column, so it will fail during 
> pruning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8005) New lock provider implementation

2024-07-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-8005:
-
Labels: pull-request-available  (was: )

> New lock provider implementation
> 
>
> Key: HUDI-8005
> URL: https://issues.apache.org/jira/browse/HUDI-8005
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Davis Zhang
>Assignee: Davis Zhang
>Priority: Major
>  Labels: pull-request-available
>
> h2. Estimated effort: 2 days
>  
> *New LP is only DynamoDb based. Zookeeper is beyond the scope here.*
>  
> As of today, LP like dynamoDb generates a per-table LP attribute 
> {{partition-key}} which is used as the name of the lock which readers and 
> writers should grab on the DDB side. Its schema is {{- chars of the table uuid>}} to ensure uniqueness and 1 to 1 mapping between 
> the key and the table. The table UUID is purely onehouse specific stuff which 
> is not accessible from hudi writers' context. Hudi writer only have access to 
> HoodieWriterConfig and HoodieTableConfig. This means the partition key is 
> absent from the knowledge of hudi writers initiated by SQL.
>  
> The proposed solution is to change the schema of {{partition-key}} to be 
> {{-}} . Considering table name and table 
> base path can be derived from writer configs by hudi writer, this addresses 
> the issue.
>  
> Properties of partition key: * {*}Uniqueness{*}: The lock key must be unique 
> for each resource you want to lock. This ensures that different resources are 
> independently locked.
>  * {*}Meaningful Naming{*}: Use meaningful names for lock keys to make it 
> clear what resource is being locked. This is particularly useful for 
> debugging and maintenance.
>  * {*}DynamoDB Partition Key Limits{*}: DynamoDB has limits on the size of 
> partition keys. The maximum length for a partition key is 2048 bytes when 
> using UTF-8 encoding. Ensure your lock keys do not exceed this limit
>  ** As of today, hudi does not enforce length on table name. The follow up 
> task is tracked here . It is beyond M1 scope.
>  
> For now, if the newly generated partition key is more than 2048 bytes, *we 
> will simply truncate the table name* to ensure the hash part can fit in.
> {{-}}
>  
> h3. Hash function
> We can use any main stream non-cryptographic hash libraray like murmur, 
> FarmHash.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8009) Optimize the code of HoodieTable#getPendingCommitTimeline

2024-07-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-8009:
-
Labels: pull-request-available  (was: )

> Optimize the code of HoodieTable#getPendingCommitTimeline
> -
>
> Key: HUDI-8009
> URL: https://issues.apache.org/jira/browse/HUDI-8009
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: bradley
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4625) Clean up KafkaOffsetGen

2024-07-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4625:
-
Labels: pull-request-available  (was: )

> Clean up KafkaOffsetGen
> ---
>
> Key: HUDI-4625
> URL: https://issues.apache.org/jira/browse/HUDI-4625
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Alexey Kudinkin
>Assignee: Vova Kolmakov
>Priority: Major
>  Labels: pull-request-available
>
> There are a few issues w/in KafkaOffsetGen that we should follow-up on 
> annotated w/ corresponding TODOs:
>  # Using proper retrying client (instead of using sleeps for coordination)
>  # Cleaning up incorrect assertions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8008) Resolve Proto Schemas returned from Confluent registry

2024-07-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-8008:
-
Labels: pull-request-available  (was: )

> Resolve Proto Schemas returned from Confluent registry
> --
>
> Key: HUDI-8008
> URL: https://issues.apache.org/jira/browse/HUDI-8008
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
>
> The Confluent Schema Registry can return a proto schema with references to 
> other proto schemas. These references need to be resolved. The registry SDK 
> will handle this automatically for us so we can update to use that instead of 
> our own http client setup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8007) ShowInvalidParquetProcedure support delete by parameter

2024-07-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-8007:
-
Labels: pull-request-available  (was: )

> ShowInvalidParquetProcedure support delete by parameter
> ---
>
> Key: HUDI-8007
> URL: https://issues.apache.org/jira/browse/HUDI-8007
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: cli
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8006) glue sync may not update columns

2024-07-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-8006:
-
Labels: pull-request-available  (was: )

> glue sync may not update columns 
> -
>
> Key: HUDI-8006
> URL: https://issues.apache.org/jira/browse/HUDI-8006
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: meta-sync
>Reporter: nicolas paris
>Assignee: nicolas paris
>Priority: Major
>  Labels: pull-request-available
>
> Due to async calls not ended on time, glue sync may consider table schema not 
> up to date in case of sequence such as:
>  # promote type
>  # add a table property
> then 2 will retrieve the table schema before 1, and then 1 changes will be 
> discarded



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8002) Add Flink 1.15 and 1.14 bundle validation

2024-07-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-8002:
-
Labels: pull-request-available  (was: )

> Add Flink 1.15 and 1.14 bundle validation 
> --
>
> Key: HUDI-8002
> URL: https://issues.apache.org/jira/browse/HUDI-8002
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: Jonathan Vexler
>Assignee: Vova Kolmakov
>Priority: Major
>  Labels: pull-request-available
>
> Removed in HUDI-7999 but might need to add back if we are not discontinuing 
> support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7964) Partitions not created correctly with SQL when multiple partitions specified out of order

2024-07-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7964:
-
Labels: pull-request-available spark-sql  (was: spark-sql)

> Partitions not created correctly with SQL when multiple partitions specified 
> out of order
> -
>
> Key: HUDI-7964
> URL: https://issues.apache.org/jira/browse/HUDI-7964
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available, spark-sql
> Fix For: 1.0.0
>
> Attachments: Screenshot 2024-07-06 at 11.34.17 AM.png, Screenshot 
> 2024-07-11 at 5.43.41 PM.png
>
>
> When multiple partitions are specified out of order (as compared to the order 
> of fields in the create table command), the partitioning on storage is 
> incorrect. Test script (notice that create table or insert into command has 
> city and then state, while the partitioned by clause has state first and then 
> city):
> {code:java}
> DROP TABLE IF EXISTS hudi_table_mlp;
> CREATE TABLE hudi_table_mlp (
>   ts BIGINT,
>   id STRING,
>   rider STRING,
>   driver STRING,
>   fare DOUBLE,
>   city STRING,
>   state STRING) 
> USING HUDI options(
>   primaryKey ='id',
>   preCombineField = 'ts')
> PARTITIONED BY (state, city)location 'file:///tmp/hudi_table_mlp';
> INSERT INTO hudi_table_mlp VALUES 
> (1695159649,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco','california');
> INSERT INTO hudi_table_mlp VALUES 
> (1695091554,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70,'sunnyvale','california');
> INSERT INTO hudi_table_mlp VALUES 
> (1695332066,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'austin','texas');
> INSERT INTO hudi_table_mlp VALUES 
> (1695516137,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'houston','texas');
>  {code}
> This creates partition as follows (note that city and state values are 
> swapped):
> !Screenshot 2024-07-11 at 5.43.41 PM.png|width=737,height=335!
> Now, if i query with state='texas' filter, there are no results:
> {code:java}
> spark-sql> select * from hudi_table_mlp where state='texas'; -- no results --
> Time taken: 0.356 seconds {code}
> I have tested this with master, 0.15.0 and 0.14.1, so it's not a recent 
> regression.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8003) Add overwrite payload for hive for record reader

2024-07-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-8003:
-
Labels: pull-request-available  (was: )

> Add overwrite payload for hive for record reader
> 
>
> Key: HUDI-8003
> URL: https://issues.apache.org/jira/browse/HUDI-8003
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> Missing from hive, so if overwrite merger strategy is chosen, it will throw 
> an exception currently.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8001) Insert overwrite failed due to missing 'path' property when using Spark 3.5.1 and Hudi 1.0.0

2024-07-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-8001:
-
Labels: pull-request-available  (was: )

> Insert overwrite failed due to missing 'path' property when using Spark 3.5.1 
> and Hudi 1.0.0
> 
>
> Key: HUDI-8001
> URL: https://issues.apache.org/jira/browse/HUDI-8001
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ma Jian
>Priority: Major
>  Labels: pull-request-available
>
> The issue with Spark 3.5.1 arises because the 
> {{InsertIntoHoodieTableCommand}} chain calls the initialization of the 
> {{HoodieFileIndex}} class. For v1 tables, the path is stored in 
> {{{}CatalogTable#CatalogStorageFormat#storageProperties{}}}, but not in 
> {{{}CatalogTable#properties{}}}. When Spark reloads the table, it removes the 
> path key from {{{}CatalogTable#CatalogStorageFormat#storageProperties{}}}. 
> Consequently, {{InsertIntoHoodieTableCommand}} in Hudi cannot retrieve the 
> path from either {{CatalogTable#CatalogStorageFormat#storageProperties}} or 
> {{CatalogTable#properties}} during {{{}deduceOverwriteConfig{}}}. This 
> absence of the path key in {{combinedOpts}} leads to an error when 
> initializing {{{}HoodieFileIndex{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7999) Disable ci testing for spark versions less than 3.3

2024-07-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7999:
-
Labels: pull-request-available  (was: )

> Disable ci testing for spark versions less than 3.3
> ---
>
> Key: HUDI-7999
> URL: https://issues.apache.org/jira/browse/HUDI-7999
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> We will be removing support for spark 2.4, 3.0, 3.1, 3.2. Before we remove 
> them we will transition our ci to run all the tests on higher versions of 
> spark. To prevent ci failures during this transition, we will stop the ci 
> from running tests on the versions to be discontinued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7998) Failed to insert overwrite hudi table when defining partition column with int type

2024-07-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7998:
-
Labels: pull-request-available  (was: )

> Failed to insert overwrite hudi table when defining partition column with int 
> type
> --
>
> Key: HUDI-7998
> URL: https://issues.apache.org/jira/browse/HUDI-7998
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
>  Labels: pull-request-available
>
> [https://github.com/apache/hudi/issues/11623]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7997) fix hivesync to support promotion type correctly

2024-07-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7997:
-
Labels: pull-request-available  (was: )

> fix hivesync to support promotion type correctly
> 
>
> Key: HUDI-7997
> URL: https://issues.apache.org/jira/browse/HUDI-7997
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: meta-sync
>Reporter: nicolas paris
>Assignee: nicolas paris
>Priority: Major
>  Labels: pull-request-available
>
> see https://github.com/apache/hudi/issues/11599



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7919) Make integration tests run on Spark 3.5

2024-07-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7919:
-
Labels: pull-request-available  (was: )

> Make integration tests run on Spark 3.5
> ---
>
> Key: HUDI-7919
> URL: https://issues.apache.org/jira/browse/HUDI-7919
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> The job "[integration-tests (spark2.4, 
> spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz)|https://github.com/apache/hudi/actions/runs/9636476688/job/26574480698#logs]";
>  in Github Java CI should run on Spark 3.5 after we can remove Spark 2 
> support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7996) Store partition type with partition fields in table configs

2024-07-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7996:
-
Labels: pull-request-available  (was: )

> Store partition type with partition fields in table configs
> ---
>
> Key: HUDI-7996
> URL: https://issues.apache.org/jira/browse/HUDI-7996
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7995) support decoding number(int, long, double) to fixed field when using JsonKafkaSource

2024-07-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7995:
-
Labels: pull-request-available  (was: )

> support decoding number(int, long, double) to fixed field when using 
> JsonKafkaSource
> 
>
> Key: HUDI-7995
> URL: https://issues.apache.org/jira/browse/HUDI-7995
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Kong Wei
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7934) RocksDBDAO prefixDelete function doesn't delete the last entry

2024-07-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7934:
-
Labels: pull-request-available  (was: )

> RocksDBDAO prefixDelete function doesn't delete the last entry
> --
>
> Key: HUDI-7934
> URL: https://issues.apache.org/jira/browse/HUDI-7934
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vova Kolmakov
>Assignee: Vova Kolmakov
>Priority: Minor
>  Labels: pull-request-available
>
> [https://github.com/apache/hudi/issues/11075]
> getRocksDB().delete(lastEntry.getBytes());
> should be changed to
> getRocksDB().delete(managedHandlesMap.get(columnFamilyName), 
> lastEntry.getBytes());
> And UT (TestRocksDBDAO) must be fixed appropriately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7989) Fix secondary index updates with other indexes

2024-07-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7989:
-
Labels: pull-request-available  (was: )

> Fix secondary index updates with other indexes
> --
>
> Key: HUDI-7989
> URL: https://issues.apache.org/jira/browse/HUDI-7989
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7988) ListingBasedRollbackStrategy support logcompact

2024-07-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7988:
-
Labels: pull-request-available  (was: )

> ListingBasedRollbackStrategy support logcompact
> ---
>
> Key: HUDI-7988
> URL: https://issues.apache.org/jira/browse/HUDI-7988
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
>  Labels: pull-request-available
>
> [1. 
> https://github.com/apache/hudi/issues/11589|https://github.com/apache/hudi/issues/11589]
>  as this issue describe, not support log compact rollback when not use 
> marker, will throw 
> {{`org.apache.hudi.exception.HoodieRollbackException: Unknown listing type, 
> during rollback of`}}
> {{}}
> {{2. and, if instance is complete, also can not delete logcompact file 
> because it's seen as `compact action`, only delete base file}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7986) Make precombine field optional with Dedup feature for Mutable Streams

2024-07-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7986:
-
Labels: pull-request-available  (was: )

> Make precombine field optional with Dedup feature for Mutable Streams
> -
>
> Key: HUDI-7986
> URL: https://issues.apache.org/jira/browse/HUDI-7986
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sivaguru Kannan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7985) Support more formats in timestamp logical types in Json Avro converter

2024-07-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7985:
-
Labels: pull-request-available  (was: )

> Support more formats in timestamp logical types in Json Avro converter
> --
>
> Key: HUDI-7985
> URL: https://issues.apache.org/jira/browse/HUDI-7985
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> We need to make sure "2024-06-03 13:42:34.951+00:00" is supported in 
> timestamp logical type.
>  * ISO 8601 supports the zone offset in the standard, e.g., {{+01:00}} , and 
> {{Z}} is the zone offset equivalent to {{+00:00}} or UTC 
> ([ref1|https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators])
>  * {{2011-12-03T10:15:30+01:00}} conforms to ISO 8601 with {{T}} as the 
> separation character
>  * There are systems that use \{{ }} (space) instead of {{T}} as the 
> separation (other parts are the same).  References indicate that ISO-8601 
> used to allow this by _mutual agreement_ 
> ([ref2|https://stackoverflow.com/questions/30201003/how-to-deal-with-optional-t-in-iso-8601-timestamp-in-java-8-jsr-310-threet],
>  
> [ref3|https://www.reddit.com/r/ISO8601/comments/173r61j/t_vs_space_separation_of_date_and_time/])
>  * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} can successfully parse 
> timestamps like {{2024-05-13T23:53:36.004Z}} , already supported in 
> {{{}MercifulJsonConverter{}}}, and additionally {{2011-12-03T10:15:30+01:00}} 
> with zone offset (which is not supported in {{MercifulJsonConverter}} yet)
>  * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} cannot parse the timestamp with 
> space as the separator, like {{2011-12-03 10:15:30+01:00}} .  But with a 
> simple twist of the formatter, it can be easily supported.
> My take is we should change the formatter of the timestamp logical types to 
> support zone offset and space character as the separator (which is backwards 
> compatible), instead of introducing a new config of format (assuming that 
> common use cases just have space character as the variant). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7938) Missed HoodieSparkKryoRegistrar in Hadoop config by default

2024-07-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7938:
-
Labels: pull-request-available  (was: )

> Missed HoodieSparkKryoRegistrar in Hadoop config by default
> ---
>
> Key: HUDI-7938
> URL: https://issues.apache.org/jira/browse/HUDI-7938
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>  Labels: pull-request-available
>
> HUDI-7567 Add schema evolution to the filegroup reader (#10957),
> but broke integration with PySpark.
> When trying to call
> {quote}df_load = 
> spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path)
> df_load.collect()
> {quote}
>  
> got:
>  
> {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 
> (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException
>     at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:139)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> {quote}
> Spark 3.4.3 was used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7980) Optimize the configuration content when performing clustering with row writer

2024-07-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7980:
-
Labels: pull-request-available  (was: )

> Optimize the configuration content when performing clustering with row writer
> -
>
> Key: HUDI-7980
> URL: https://issues.apache.org/jira/browse/HUDI-7980
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ma Jian
>Priority: Major
>  Labels: pull-request-available
>
> Currently, the row writer defaults to snapshot reads for all tables. However, 
> this method is relatively inefficient for MOR (Merge on Read) tables when 
> there are no logs. Therefore, we should optimize this part of the 
> configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7976) Fix BUG introduced in HUDI-7955 due to usage of wrong class

2024-07-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7976:
-
Labels: pull-request-available  (was: )

> Fix BUG introduced in HUDI-7955 due to usage of wrong class
> ---
>
> Key: HUDI-7976
> URL: https://issues.apache.org/jira/browse/HUDI-7976
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: voon
>Assignee: voon
>Priority: Major
>  Labels: pull-request-available
>
> In the bugfix for HUDI-7955, the wrong class for invoking {{getTimestamp 
> }}was used.
>  # {*}Wrong{*}: org.apache.hadoop.hive.common.type.Timestamp
>  # {*}Correct{*}: org.apache.hadoop.hive.serde2.io.TimestampWritableV2
>  
> !https://git.garena.com/shopee/data-infra/hudi/uploads/eeff29b3e741c65eeb48f9901fa28da0/image.png|width=468,height=235!
>  
> Submitting a bugfix to fix this bugfix... 
> Log levels for the exception block is also changed to warn so errors will be 
> printed out.
> On top of that, we have simplified the {{getMillis}} shim to remove the 
> method that was added in HUDI-7955 to standardise it with how {{getDays}} is 
> written.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7979) Fix out of the box defaults with spillable memory configs

2024-07-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7979:
-
Labels: pull-request-available  (was: )

> Fix out of the box defaults with spillable memory configs 
> --
>
> Key: HUDI-7979
> URL: https://issues.apache.org/jira/browse/HUDI-7979
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core, writer-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> Looks like we are very conservative wrt memory configs used for spillable map 
> based FSV. 
>  
> For eg, we are only allocating 15Mb out of the box to file groups when using 
> spillable map based FSV.
>  public long getMaxMemoryForFileGroupMap() \{
> long totalMemory = getLong(SPILLABLE_MEMORY);
> return totalMemory - getMaxMemoryForPendingCompaction() - 
> getMaxMemoryForBootstrapBaseFile();
>   }
>  
> SPILLABLE_MEMORY = default is 100Mb.
> getMaxMemoryForPendingCompaction = 80% of 100MB.
> getMaxMemoryForBootstrapBaseFile = 5% of 100Mb.
> so, overall, out of the box we are allocating only 15Mb for 
> getMaxMemoryForFileGroupMap.
> ref: 
> [https://github.com/apache/hudi/blob/bb0621edee97507cf2460e8cb57b5307510b917e/hudi-[…]/apache/hudi/common/table/view/FileSystemViewStorageConfig.java|https://github.com/apache/hudi/blob/bb0621edee97507cf2460e8cb57b5307510b917e/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewStorageConfig.java#L224]
> Wondering do we even need 80% for pending compaction tracker in our FSV. I am 
> thinking to make it 15%. so that we can give more memory to actual file 
> groups. We may not have lot of pending compactions for a given table. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7978) Update docs for older versions to state that partitions should be ordered when creating multiple partitions

2024-07-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7978:
-
Labels: pull-request-available  (was: )

> Update docs for older versions to state that partitions should be ordered 
> when creating multiple partitions
> ---
>
> Key: HUDI-7978
> URL: https://issues.apache.org/jira/browse/HUDI-7978
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: docs
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7977) improve bucket index paritioner

2024-07-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7977:
-
Labels: pull-request-available  (was: )

> improve bucket index paritioner
> ---
>
> Key: HUDI-7977
> URL: https://issues.apache.org/jira/browse/HUDI-7977
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: index
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
>  Labels: pull-request-available
>
> imporve {{BucketIndexUtil}} partitionIndex algorithm make the data be evenly 
> distributed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7975) Transfer extrametada to new commits when new data is not ingeested to trigger table services on the dataset

2024-07-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7975:
-
Labels: pull-request-available  (was: )

> Transfer extrametada to new commits when new data is not ingeested to trigger 
> table services on the dataset
> ---
>
> Key: HUDI-7975
> URL: https://issues.apache.org/jira/browse/HUDI-7975
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Surya Prasanna Yalla
>Assignee: Surya Prasanna Yalla
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7974) Create empty clean commit at a cadence and make it configurable

2024-07-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7974:
-
Labels: pull-request-available  (was: )

> Create empty clean commit at a cadence and make it configurable
> ---
>
> Key: HUDI-7974
> URL: https://issues.apache.org/jira/browse/HUDI-7974
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Surya Prasanna Yalla
>Assignee: Surya Prasanna Yalla
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7970) Add support to read partition fields when partition type is also stored in table config

2024-07-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7970:
-
Labels: pull-request-available  (was: )

> Add support to read partition fields when partition type is also stored in 
> table config
> ---
>
> Key: HUDI-7970
> URL: https://issues.apache.org/jira/browse/HUDI-7970
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>
> In HUDI-7902, we will modify the config value `hoodie.table.partition.fields` 
> to also store partition type. This PR aims to make sure that the getter and 
> other functions accessing this field remain consistent in behaviour with the 
> new value type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7969) Fix data loss caused by concurrent write and clean

2024-07-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7969:
-
Labels: pull-request-available  (was: )

> Fix data loss caused by concurrent write and clean
> --
>
> Key: HUDI-7969
> URL: https://issues.apache.org/jira/browse/HUDI-7969
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Xinyu Zou
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7692) Move MDT partiiton type code in HoodieMetadataPaylaod to MetadataPartitionType

2024-07-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7692:
-
Labels: hudi-1.0.0-beta2 pull-request-available  (was: hudi-1.0.0-beta2)

> Move MDT partiiton type code in HoodieMetadataPaylaod to MetadataPartitionType
> --
>
> Key: HUDI-7692
> URL: https://issues.apache.org/jira/browse/HUDI-7692
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> https://github.com/apache/hudi/pull/10352#discussion_r1584137942



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7025) Merge Index and Functional Index Config

2024-07-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7025:
-
Labels: hudi-1.0.0-beta2 pull-request-available  (was: hudi-1.0.0-beta2)

> Merge Index and Functional Index Config
> ---
>
> Key: HUDI-7025
> URL: https://issues.apache.org/jira/browse/HUDI-7025
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Minor
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> There is {{INDEX}} sub-group name in `ConfigGroups`. Functional index configs 
> can be consolidated within that.
>  
> https://github.com/apache/hudi/pull/9872#discussion_r1377115549



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7967) Robust handling of spark task failures and retries

2024-07-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7967:
-
Labels: RobustWrites pull-request-available  (was: RobustWrites)

> Robust handling of spark task failures and retries 
> ---
>
> Key: HUDI-7967
> URL: https://issues.apache.org/jira/browse/HUDI-7967
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: reader-core, writer-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: RobustWrites, pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7968) RFC for robust handling of spark task failures and retries

2024-07-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7968:
-
Labels: RobustWrites pull-request-available  (was: RobustWrites)

> RFC for robust handling of spark task failures and retries
> --
>
> Key: HUDI-7968
> URL: https://issues.apache.org/jira/browse/HUDI-7968
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: RobustWrites, pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7962) Add show create table command

2024-07-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7962:
-
Labels: pull-request-available  (was: )

> Add show create table command
> -
>
> Key: HUDI-7962
> URL: https://issues.apache.org/jira/browse/HUDI-7962
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: cli
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7966) NPE from AvroSchemaUtils.createNewSchemaFromFieldsWithReference

2024-07-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7966:
-
Labels: pull-request-available  (was: )

> NPE from AvroSchemaUtils.createNewSchemaFromFieldsWithReference
> ---
>
> Key: HUDI-7966
> URL: https://issues.apache.org/jira/browse/HUDI-7966
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Running 
> [long-running|https://github.com/apache/hudi/blob/dbfe8b23c0b4f160b26379053873cfc2a46acef4/docker/demo/config/test-suite/spark-long-running-non-partitioned.yaml]
>  deltastreamer with following properties: 
> [https://github.com/apache/hudi/blob/dbfe8b23c0b4f160b26379053873cfc2a46acef4/docker/demo/config/test-suite/test-nonpartitioned.properties]
> The job throws NPE during validation phase:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 69.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 69.0 (TID 345) (10.0.103.207 executor 1): 
> java.lang.NullPointerException  at 
> org.apache.avro.JsonProperties$2$1$1.(JsonProperties.java:175)  at 
> org.apache.avro.JsonProperties$2$1.iterator(JsonProperties.java:174)  at 
> org.apache.avro.JsonProperties.getObjectProps(JsonProperties.java:305)  at 
> org.apache.hudi.avro.AvroSchemaUtils.createNewSchemaFromFieldsWithReference(AvroSchemaUtils.java:306)
>   at 
> org.apache.hudi.avro.AvroSchemaUtils.appendFieldsToSchemaBase(AvroSchemaUtils.java:293)
>   at 
> org.apache.hudi.avro.AvroSchemaUtils.appendFieldsToSchemaDedupNested(AvroSchemaUtils.java:245)
>   at 
> org.apache.hudi.common.table.read.HoodieFileGroupReaderSchemaHandler.generateRequiredSchema(HoodieFileGroupReaderSchemaHandler.java:146)
>   at 
> org.apache.hudi.common.table.read.HoodieFileGroupReaderSchemaHandler.prepareRequiredSchema(HoodieFileGroupReaderSchemaHandler.java:150)
>   at 
> org.apache.hudi.common.table.read.HoodieFileGroupReaderSchemaHandler.(HoodieFileGroupReaderSchemaHandler.java:84)
>   at 
> org.apache.hudi.common.table.read.HoodieFileGroupReader.(HoodieFileGroupReader.java:113)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$3(HoodieFileGroupReaderBasedParquetFileFormat.scala:170)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:209)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:270)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown
>  Source)  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)  at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)  at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)  
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) 
>  at org.apache.spark.scheduler.Task.run(Task.scala:136)  at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)  at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750) {code}
> It seems like the code assumes that all schema must have properties, which 
> may not necessaily be true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7965) Clean up SchemaTestUtil code

2024-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7965:
-
Labels: pull-request-available  (was: )

> Clean up SchemaTestUtil code
> 
>
> Key: HUDI-7965
> URL: https://issues.apache.org/jira/browse/HUDI-7965
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: bradley
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7963) Avoid generating RLI records when disabled w/ MDT

2024-07-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7963:
-
Labels: pull-request-available  (was: )

> Avoid generating RLI records when disabled w/ MDT
> -
>
> Key: HUDI-7963
> URL: https://issues.apache.org/jira/browse/HUDI-7963
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7961) Optimize UpsertPartitioner for prepped write operations

2024-07-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7961:
-
Labels: pull-request-available  (was: )

> Optimize UpsertPartitioner for prepped write operations
> ---
>
> Key: HUDI-7961
> URL: https://issues.apache.org/jira/browse/HUDI-7961
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> We have avg record size calculation etc in UpsertPartitioner which does not 
> makes sense for prepped write operations. also, w/ MDT, we can optimize 
> these. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7958) Create partition stats index for all columns when no columns specified

2024-07-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7958:
-
Labels: pull-request-available  (was: )

> Create partition stats index for all columns when no columns specified
> --
>
> Key: HUDI-7958
> URL: https://issues.apache.org/jira/browse/HUDI-7958
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Just like column stats index, we can create partition stats index for all 
> column if no columns configured by the user.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7957) data skew when writing with bulk_insert + bucket_index enabled

2024-07-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7957:
-
Labels: pull-request-available  (was: )

> data skew when writing with bulk_insert + bucket_index enabled
> --
>
> Key: HUDI-7957
> URL: https://issues.apache.org/jira/browse/HUDI-7957
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
>  Labels: pull-request-available
>
> as  [https://github.com/apache/hudi/issues/11565] say, when use bulk insert 
> as row if table is bucket, data will skew, because of the partitioner 
> algorithm



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7955) Account for WritableTimestampObjectInspector#getPrimitiveJavaObject Hive3 and Hive2 discrepancies

2024-07-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7955:
-
Labels: pull-request-available  (was: )

> Account for WritableTimestampObjectInspector#getPrimitiveJavaObject Hive3 and 
> Hive2 discrepancies
> -
>
> Key: HUDI-7955
> URL: https://issues.apache.org/jira/browse/HUDI-7955
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: voon
>Assignee: voon
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-07-05-18-11-33-420.png, 
> image-2024-07-05-18-13-28-135.png
>
>
> The invocation of *getPrimitiveJavaObject* returns a different implementation 
> of timestamp in Hive3 and Hive2. 
>  - Hive2: *java.sql.Timestamp*
>  - Hive3: *org.apache.hadoop.hive.common.type.Timestamp*
> Hudi common is compiled with Hive2, but Trino is using Hive3, causing the 
> discrepancy between compile and runtime. When execution flow falls into this 
> section of the code where the trigger conditions are listed below:
> 1. MOR table is used
> 2. User is querying the _rt table
> 3. User's table has a *TIMESTAMP* type and query requires it
> 4. Merge is required as record is present in both Parquet and Log file
> Error below will be thrown:
> {code:java}
> Query 20240704_075218_05052_yfmfc failed: 'java.sql.Timestamp 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(java.lang.Object)'
> java.lang.NoSuchMethodError: 'java.sql.Timestamp 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(java.lang.Object)'
>         at 
> org.apache.hudi.hadoop.utils.HiveAvroSerializer.serializePrimitive(HiveAvroSerializer.java:304)
>         at 
> org.apache.hudi.hadoop.utils.HiveAvroSerializer.serialize(HiveAvroSerializer.java:212)
>         at 
> org.apache.hudi.hadoop.utils.HiveAvroSerializer.setUpRecordFieldFromWritable(HiveAvroSerializer.java:121)
>         at 
> org.apache.hudi.hadoop.utils.HiveAvroSerializer.serialize(HiveAvroSerializer.java:108)
>         at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.convertArrayWritableToHoodieRecord(RealtimeCompactedRecordReader.java:185)
>         at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.mergeRecord(RealtimeCompactedRecordReader.java:172)
>         at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:114)
>         at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:49)
>         at 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:88)
>         at 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36)
>         at 
> io.trino.plugin.hive.GenericHiveRecordCursor.advanceNextPosition(GenericHiveRecordCursor.java:215)
>         at 
> io.trino.spi.connector.RecordPageSource.getNextPage(RecordPageSource.java:88)
>         at 
> io.trino.plugin.hudi.HudiPageSource.getNextPage(HudiPageSource.java:120){code}
> h1. Hive3
> !image-2024-07-05-18-11-33-420.png|width=509,height=572!
> h1. Hive2
> !image-2024-07-05-18-13-28-135.png|width=507,height=501!
>  
> h1. How to reproduce
>  
>  
> {code:java}
> CREATE TABLE dev_hudi.hudi_7955__hive3_timestamp_issue (
>     id INT,
>     name STRING,
>     timestamp_col TIMESTAMP,
>     grass_region STRING
> ) USING hudi
> PARTITIONED BY (grass_region)
> tblproperties (
>     primaryKey = 'id',
>     type = 'mor',
>     precombineField = 'id',
>     hoodie.index.type = 'BUCKET',
>     hoodie.index.bucket.engine = 'CONSISTENT_HASHING',
>     hoodie.compact.inline = 'true'
> )
> LOCATION 'hdfs://path/to/hudi_tables/hudi_7955__hive3_timestamp_issue';
> -- 5 separate commits to trigger compaction
> INSERT INTO dev_hudi.hudi_7955__hive3_timestamp_issue VALUES (1, 'alex1', 
> now(), 'SG');
> -- No error here as there no MERGE is required between Parquet + Log
> SELECT _hoodie_file_name, id, timestamp_col FROM 
> dev_hudi.hudi_7955__hive3_timestamp_issue_rt WHERE _hoodie_file_name NOT LIKE 
> '%parquet%';
> INSERT INTO dev_hudi.hudi_7955__hive3_timestamp_issue VALUES (2, 'alex2', 
> now(), 'SG'

[jira] [Updated] (HUDI-7954) Fix data skipping with secondary index when there are no log files

2024-07-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7954:
-
Labels: pull-request-available  (was: )

> Fix data skipping with secondary index when there are no log files
> --
>
> Key: HUDI-7954
> URL: https://issues.apache.org/jira/browse/HUDI-7954
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0-beta2, 1.0.0
>
>
> When there are no log files in index, then the lookup returns no secondary 
> keys or candidate files, because of a bug - `logRecordsMap` is empty in this 
> code and base file records are ignored - 
> [https://github.com/apache/hudi/blob/70f44efe298771fcef9d029820a9b431e1ff165c/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java#L970]
> Even though current tests for pruning asserts the filtered files count < 
> total data files count. It is weak in the sense that it does not filtered 
> files count > 0, and hence the assertion passed even when filtered files 
> count = 0. Ultimately, all files were getting scanned. We should fix this 
> behavior



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7953) Improved the variable naming and formatting of HoodieActiveTimeline and HoodieIndex

2024-07-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7953:
-
Labels: pull-request-available  (was: )

> Improved the variable naming and formatting of HoodieActiveTimeline and 
> HoodieIndex
> ---
>
> Key: HUDI-7953
> URL: https://issues.apache.org/jira/browse/HUDI-7953
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: bradley
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6510) Java 17 compile time support

2024-07-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6510:
-
Labels: pull-request-available  (was: )

> Java 17 compile time support
> 
>
> Key: HUDI-6510
> URL: https://issues.apache.org/jira/browse/HUDI-6510
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Udit Mehrotra
>Assignee: Shawn Chang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Certify Hudi with Java 17 compile time support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7929) Add Flink Hudi Example for K8s

2024-07-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7929:
-
Labels: pull-request-available  (was: )

> Add Flink Hudi Example for K8s
> --
>
> Key: HUDI-7929
> URL: https://issues.apache.org/jira/browse/HUDI-7929
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink
>Reporter: Zhenqiu Huang
>Assignee: Zhenqiu Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7949) insert into hudi table with columns specified(reordered and not in table schema order) throws exception

2024-07-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7949:
-
Labels: pull-request-available  (was: )

> insert into hudi table with columns specified(reordered and not in table 
> schema order) throws exception
> ---
>
> Key: HUDI-7949
> URL: https://issues.apache.org/jira/browse/HUDI-7949
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark-sql
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
>  Labels: pull-request-available
>
> https://github.com/apache/hudi/issues/11552



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7937) Fix handling of decimals in StreamSync and Clustering

2024-07-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7937:
-
Labels: pull-request-available  (was: )

> Fix handling of decimals in StreamSync and Clustering
> -
>
> Key: HUDI-7937
> URL: https://issues.apache.org/jira/browse/HUDI-7937
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
>
> When decimals are using a small precision, we need to write them in legacy 
> format to ensure all hudi components can read them back. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7951) Classes using avro causing conflict in hudi-aws-bundle

2024-07-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7951:
-
Labels: pull-request-available  (was: )

> Classes using avro causing conflict in hudi-aws-bundle
> --
>
> Key: HUDI-7951
> URL: https://issues.apache.org/jira/browse/HUDI-7951
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Shawn Chang
>Priority: Major
>  Labels: pull-request-available
>
> Hudi 0.15 added some Hudi classes with avro usages 
> (ParquetTableSchemaResolver in this case), also had hudi-aws-bundle depend on 
> hudi-hadoop-common. hudi-aws-bundle won't relocate avro classes to be 
> compatible with hudi-spark.
>  
> The issue would happen when using hudi-flink-bundle with hudi-aws-bundle. 
> hudi-flink-bundle has relocated avro classes and would cause class conflict:
> {code:java}
> java.lang.NoSuchMethodError: 'org.apache.parquet.schema.MessageType 
> org.apache.hudi.common.table.ParquetTableSchemaResolver.convertAvroSchemaToParquet(org.apache.hudi.org.apache.avro.Schema,
>  org.apache.hadoop.conf.Configuration)'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7950) Shade roaring bitmap dependency in root POM

2024-07-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7950:
-
Labels: pull-request-available  (was: )

> Shade roaring bitmap dependency in root POM
> ---
>
> Key: HUDI-7950
> URL: https://issues.apache.org/jira/browse/HUDI-7950
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0-beta2, 1.0.0, 0.15.1
>
>
> We should unify the shading rule of roaring bitmap dependency in the root POM 
> for consistency among bundles.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7941) add show_file_status procedure

2024-07-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7941:
-
Labels: pull-request-available  (was: )

> add show_file_status procedure
> --
>
> Key: HUDI-7941
> URL: https://issues.apache.org/jira/browse/HUDI-7941
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: 陈磊
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0
>
>
> When incrementally consuming the hudi table or performing clustering or 
> compaction operations on the hudi table, it is often found that a certain 
> file does not exist. If you want to know which operation deleted the file, it 
> is a very troublesome operation. For this purpose, we provide a tool 
> `show_file_status` to view whether a specified file has been deleted and what 
> actions have been taken to delete it.
> usage:
> call show_file_status(table => '$tableName', partition => '$partition', file 
> => '$fileName')
> call show_file_status(table => '$tableName', file => '$fileName')
> output:
> 1)the file was deleted by the restore action
> +---+---+-++-+
> |status |action |instant  |timeline|full_path|
> +---+---+-++-+
> |deleted|restore|20240629225539880|active  | |
> +---+---+-++-+
> 2)the file has been deleted in other ways, such as hdfs dfs -rm
> +---+--+---++-+
> |status |action|instant|timeline|full_path|
> +---+--+---++-+
> |unknown|  |   || |
> +---+--+---++-+
> 3) the file exists
> +--+--+---++---+
> |status|action|instant|timeline|full_path 
>   
>|
> +--+--+---++---+
> |exist |  |   |active  
> |/Users/xx/xx/others/data/hudi-warehouse/source1/hudi_mor_append/sex=0/85ad0f44-22bf-4733-99bf-06382d6eacd5-0_0-130-89_20240629230123162.parquet|
> +--+--+---++---+



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7948) RFC-80: Support column families for wide tables

2024-07-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7948:
-
Labels: pull-request-available  (was: )

> RFC-80: Support column families for wide tables
> ---
>
> Key: HUDI-7948
> URL: https://issues.apache.org/jira/browse/HUDI-7948
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vova Kolmakov
>Assignee: Vova Kolmakov
>Priority: Major
>  Labels: pull-request-available
>
> Write, discuss, approve RFC document in github



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7943) Resolve version conflict of fasterxml on spark3.2

2024-07-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7943:
-
Labels: pull-request-available  (was: )

> Resolve version conflict of fasterxml on spark3.2 
> --
>
> Key: HUDI-7943
> URL: https://issues.apache.org/jira/browse/HUDI-7943
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dependencies
> Environment: hudi0.14.1, Spark3.2
>Reporter: Jihwan Lee
>Priority: Major
>  Labels: pull-request-available
>
> When run streaming read on spark3.2, raise exception that requires correct 
> version of jackson databind.
> Spark versions except 3.2 seem to use versions related to Spark dependencies.
>  
> version refer: https://github.com/apache/spark/blob/v3.2.3/pom.xml#L170
>  
> example code:
>  
> {code:java}
> import scala.collection.JavaConversions._
> import org.apache.spark.sql.SaveMode._
> import org.apache.hudi.DataSourceReadOptions._
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.hudi.common.table.HoodieTableConfig._
> import org.apache.hudi.config.HoodieWriteConfig._
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.common.model.HoodieRecord
> import spark.implicits._
> val basePath = "hdfs:///tmp/trips_table"
> spark.readStream
> .format("hudi")
> .option("hoodie.datasource.query.type", "incremental")
> .option("hoodie.datasource.query.incremental.format", "cdc")
> .load(basePath)
> .writeStream
> .format("console")
> .option("checkpointLocation", "/tmp/trips_table_checkpoint")
> .outputMode("append")
> .start().awaitTermination()
> {code}
>  
>  
> error log:
>  
> {code:java}
> Caused by: java.lang.ExceptionInInitializerError: 
> com.fasterxml.jackson.databind.JsonMappingException: Scala module 2.10.0 
> requires Jackson Databind version >= 2.10.0 and < 2.11.0
>   at 
> org.apache.spark.sql.hudi.streaming.HoodieSourceOffset.(HoodieSourceOffset.scala:30)
>   at 
> org.apache.spark.sql.hudi.streaming.HoodieStreamSource.getLatestOffset(HoodieStreamSource.scala:127)
>   at 
> org.apache.spark.sql.hudi.streaming.HoodieStreamSource.getOffset(HoodieStreamSource.scala:138)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$6(MicroBatchExecution.scala:403)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:375)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:373)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:69)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:402)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:384)
>   at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:627)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:380)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:210)
>   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:375)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:373)
>   at 
> org.apache.spark.sql.execution.streaming

[jira] [Updated] (HUDI-7883) Ensure 1.x commit instants are readable w/ 0.16.0

2024-07-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7883:
-
Labels: pull-request-available  (was: )

> Ensure 1.x commit instants are readable w/ 0.16.0 
> --
>
> Key: HUDI-7883
> URL: https://issues.apache.org/jira/browse/HUDI-7883
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: sivabalan narayanan
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>
> Ensure 1.x commit instants are readable w/ 0.16.0 reader.
>  
> May be we need to migrate HoodieInstant parsing logic to 0.16.0 in a 
> backwards compatible manner. or its already ported. we just need to write 
> tests and validate. 
> [https://github.com/apache/hudi/pull/9617] - contains some portion 
> (HoodieInstant changes and some method renames)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7945) Fix file pruning using PARTITION_STATS index in Spark

2024-07-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7945:
-
Labels: pull-request-available  (was: )

> Fix file pruning using PARTITION_STATS index in Spark
> -
>
> Key: HUDI-7945
> URL: https://issues.apache.org/jira/browse/HUDI-7945
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0-beta2, 1.0.0
>
>
> The issue can be reproduced by 
> [https://github.com/apache/hudi/pull/11472#issuecomment-2199332859.]
> When there are more than one base files in a table partition, the 
> corresponding PARTITION_STATS index record in the metadata table contains 
> null as the file_path field in HoodieColumnRangeMetadata.
> {code:java}
> private static > HoodieColumnRangeMetadata 
> mergeRanges(HoodieColumnRangeMetadata one,
>   
> HoodieColumnRangeMetadata another) {
>   
> ValidationUtils.checkArgument(one.getColumnName().equals(another.getColumnName()),
>   "Column names should be the same for merging column ranges");
>   final T minValue = getMinValueForColumnRanges(one, another);
>   final T maxValue = getMaxValueForColumnRanges(one, another);
>   return HoodieColumnRangeMetadata.create(
>   null, one.getColumnName(), minValue, maxValue,
>   one.getNullCount() + another.getNullCount(),
>   one.getValueCount() + another.getValueCount(),
>   one.getTotalSize() + another.getTotalSize(),
>   one.getTotalUncompressedSize() + another.getTotalUncompressedSize());
> } 
> {code}
> The null causes NPE when loading the column stats per partition from 
> PARTITION_STATS index.  Also, current implementation of 
> PartitionStatsIndexSupport assumes that the file_path field contains the 
> exact file name and it does not work if the the file path does not contain 
> null (even a list of file names stored does not work).  We have to 
> reimplement PartitionStatsIndexSupport so that it gives the pruned partitions 
> for further processing.
> {code:java}
> Caused by: java.lang.NullPointerException: element cannot be mapped to a null 
> key
>     at java.util.Objects.requireNonNull(Objects.java:228)
>     at java.util.stream.Collectors.lambda$groupingBy$45(Collectors.java:907)
>     at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
>     at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>     at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
>     at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
>     at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>     at java.util.Iterator.forEachRemaining(Iterator.java:116)
>     at 
> java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
>     at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
>     at 
> java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272)
>     at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
>     at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>     at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>     at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747)
>     at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721)
>     at java.util.stream.AbstractTask.compute(AbstractTask.java:327)
>     at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
>     at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
>     at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401)
>     at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
>     at 
> java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714)
>     at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
>     at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
>     at 
> org.apache.hudi.common.data.HoodieListPairData.groupByKey(HoodieListPairData.java:115)
>     at 
> org.apache.hudi.ColumnStatsIndexSupport.transpose(ColumnStatsIndexSupport.scala:253)
>     at 
> org.apache.hudi.ColumnStatsIndexSupport.$anonfun$loadTransposed$1(ColumnStatsIndexSupport.scala:149)
>     at 
> org.apache.hudi.Ho

[jira] [Updated] (HUDI-7940) Pass metrics to ErrorTableWriter to be able to emit metrics for Error Table

2024-07-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7940:
-
Labels: pull-request-available  (was: )

> Pass metrics to ErrorTableWriter to be able to emit metrics for Error Table
> ---
>
> Key: HUDI-7940
> URL: https://issues.apache.org/jira/browse/HUDI-7940
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Rajesh Mahindra
>Assignee: Rajesh Mahindra
>Priority: Minor
>  Labels: pull-request-available
>
> Pass metrics to ErrorTableWriter to be able to emit metrics for Error Table



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0

2024-07-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7882:
-
Labels: pull-request-available  (was: )

> Umbrella ticket to track all changes required to support reading 1.x tables 
> with 0.16.0 
> 
>
> Key: HUDI-7882
> URL: https://issues.apache.org/jira/browse/HUDI-7882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
> umbrella ticket to track all of them.
>  
> RFC in progress: [https://github.com/apache/hudi/pull/11514] 
>  
> Changes required to be ported: 
> 0. Creating 0.16.0 branch
> 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 
>  
> 1. Timeline 
> 1.a Hoodie instant parsing should be able to read 1.x instants. 
> https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 
> 1.b Commit metadata parsing is able to handle both json and avro formats. 
> Scope might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  
> Siva.
> 1.c HoodieDefaultTimeline able to read both timelines based on table version. 
>  https://issues.apache.org/jira/browse/HUDI-7884 Siva.
> 1.d Reading LSM timeline using 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7890 Siva. 
> 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901
>  
> 2. Table property changes 
> 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
> https://issues.apache.org/jira/browse/HUDI-7865 LJ
>  
> 3. MDT table changes
> 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ
> 3.b MDT payload schema changes. 
> https://issues.apache.org/jira/browse/HUDI-7886 LJ
>  
> 4. Log format changes
> 4.a All metadata header types porting 
> https://issues.apache.org/jira/browse/HUDI-7887 Jon
> 4.b Meaningful error for incompatible features from 1.x 
> https://issues.apache.org/jira/browse/HUDI-7888 Jon
>  
> 5. Log file slice or grouping detection compatibility 
>  
> 5. Tests 
> 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 
>  
> 6 Doc changes 
> 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
> https://issues.apache.org/jira/browse/HUDI-7889 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7905) Use cluster action for clustering pending instants

2024-07-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7905:
-
Labels: pull-request-available  (was: )

> Use cluster action for clustering pending instants
> --
>
> Key: HUDI-7905
> URL: https://issues.apache.org/jira/browse/HUDI-7905
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Currently, we use replacecommit for clustering, insert overwrite and delete 
> partition. Clustering should be a separate action for requested and inflight 
> instant. This simplifies a few things such as we do not need to scan the 
> replacecommit.requested to determine whether we are looking at clustering 
> plan or not. This would simplify the usage of pending clustering related 
> APIs. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7859) Rename instant files to be consistent with 0.x naming format

2024-07-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7859:
-
Labels: pull-request-available  (was: )

> Rename instant files to be consistent with 0.x naming format
> 
>
> Key: HUDI-7859
> URL: https://issues.apache.org/jira/browse/HUDI-7859
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: YangXuan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Needed for downgrade



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7915) Spark 4 support

2024-07-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7915:
-
Labels: pull-request-available  (was: )

> Spark 4 support
> ---
>
> Key: HUDI-7915
> URL: https://issues.apache.org/jira/browse/HUDI-7915
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Shawn Chang
>Priority: Major
>  Labels: pull-request-available
>
> Spark 4.0.0-preview1 is out.  We should start integrating Hudi with Spark 4 
> and surface any issues early on.
> https://spark.apache.org/news/spark-4.0.0-preview1.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4822) Extract the baseFile and logFIles from HoodieDeltaWriteStat in the right way

2024-07-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4822:
-
Labels: pull-request-available  (was: )

> Extract the baseFile and logFIles from HoodieDeltaWriteStat in the right way
> 
>
> Key: HUDI-4822
> URL: https://issues.apache.org/jira/browse/HUDI-4822
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Yann Byron
>Assignee: Vova Kolmakov
>Priority: Major
>  Labels: pull-request-available
>
> currently, we can't get the `baseFile` and `logFiles` members from 
> `HoodieDeltaWriteStat` directly. That's because it lost the related 
> information after deserialization from the commit files. So we need to 
> improve this.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7903) Partition Stats Index not getting created with SQL

2024-07-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7903:
-
Labels: pull-request-available  (was: )

> Partition Stats Index not getting created with SQL
> --
>
> Key: HUDI-7903
> URL: https://issues.apache.org/jira/browse/HUDI-7903
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0-beta2, 1.0.0
>
>
> {code:java}
> spark.sql(
>   s"""
>  | create table $tableName using hudi
>  | partitioned by (dt)
>  | tblproperties(
>  |primaryKey = 'id',
>  |preCombineField = 'ts',
>  |'hoodie.metadata.index.partition.stats.enable' = 'true'
>  | )
>  | location '$tablePath'
>  | AS
>  | select 1 as id, 'a1' as name, 10 as price, 1000 as ts, 
> cast('2021-05-06' as date) as dt
>""".stripMargin
> ) {code}
> Even when partition stats is enabled, index is not created with SQL. Works 
> for datasource.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7926) dataskipping failure mode should be strict in test

2024-06-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7926:
-
Labels: pull-request-available  (was: )

> dataskipping failure mode should be strict in test
> --
>
> Key: HUDI-7926
> URL: https://issues.apache.org/jira/browse/HUDI-7926
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Critical
>  Labels: pull-request-available
>
> dataskipping failure mode should be strict in test. if use fallback mode 
> default, the query ut is meaningless.
> There may be other codes that have been introduced into bugs but cannot be 
> measured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7709) Class Cast Exception while reading the data using TimestampBasedKeyGenerator

2024-06-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7709:
-
Labels: pull-request-available  (was: )

> Class Cast Exception while reading the data using TimestampBasedKeyGenerator
> 
>
> Key: HUDI-7709
> URL: https://issues.apache.org/jira/browse/HUDI-7709
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Github Issue - [https://github.com/apache/hudi/issues/11140]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7924) Capture Latency and Failure Metrics For Hive Table recreation

2024-06-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7924:
-
Labels: pull-request-available  (was: )

> Capture Latency and Failure Metrics For Hive Table recreation
> -
>
> Key: HUDI-7924
> URL: https://issues.apache.org/jira/browse/HUDI-7924
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vamsi Karnika
>Priority: Major
>  Labels: pull-request-available
>
> As part of recreating the glue and hive table whenever sync schema or 
> partition fails, we want to capture and push metrics related to latency(time 
> taken to recreate and sync the table) and a failure metric(when recreating 
> the table fails). * Push Latency metric to capture time taken to recreate and 
> sync the table
>  * Push a failure metric if recreate and sync fails.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7922) Add Hudi CLI bundle for Scala 2.13

2024-06-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7922:
-
Labels: pull-request-available  (was: )

> Add Hudi CLI bundle for Scala 2.13
> --
>
> Key: HUDI-7922
> URL: https://issues.apache.org/jira/browse/HUDI-7922
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Build of Hudi CLI bundle should succeed on Scala 2.13 and work on Spark 3.5 
> and Scala 2.13.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7921) Chase down memory leaks in Writeclient with MDT enabled

2024-06-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7921:
-
Labels: pull-request-available  (was: )

> Chase down memory leaks in Writeclient with MDT enabled
> ---
>
> Key: HUDI-7921
> URL: https://issues.apache.org/jira/browse/HUDI-7921
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> We see OOMs when deltastreamer is running continuously for days together. We 
> suspect some memory leaks when metadata table is enabled. Lets try to chase 
> down all of them and fix it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7911) Enable cdc log for MOR table

2024-06-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7911:
-
Labels: pull-request-available  (was: )

> Enable cdc log for MOR table
> 
>
> Key: HUDI-7911
> URL: https://issues.apache.org/jira/browse/HUDI-7911
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7920) Make Spark 3.5 the default build profile for Spark integration

2024-06-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7920:
-
Labels: pull-request-available  (was: )

> Make Spark 3.5 the default build profile for Spark integration
> --
>
> Key: HUDI-7920
> URL: https://issues.apache.org/jira/browse/HUDI-7920
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Currently, Spark 3.2 is the default build profile.  Given Spark 3.2 is no 
> longer actively maintained (latest Spark 3.2.x release i from April 2023), we 
> should upgrade the default build profile on Spark to Spark 3.5 to maintain 
> the support on latest Spark release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7914) Incorrect schema produced in DELETE_PARTITION replacecommit

2024-06-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7914:
-
Labels: pull-request-available  (was: )

> Incorrect schema produced in DELETE_PARTITION replacecommit
> ---
>
> Key: HUDI-7914
> URL: https://issues.apache.org/jira/browse/HUDI-7914
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vitali Makarevich
>Priority: Major
>  Labels: pull-request-available
>
> in the current scenario delete_partitions produces {{replacecommit}} with 
> internal fields - like {{{}_hoodie_file_name{}}}, while e.g. normal 
> {{commit}} produces schema without such fields.
> This leads to unexpected behavior when the {{replacecommit}} is the last on 
> the commitline,
> e.g. [#10258|https://github.com/apache/hudi/issues/10258]
> [#10533|https://github.com/apache/hudi/issues/10533]
> and e.g. metadata sync things, or any other potential write will take 
> incorrect schema - and in the best case will fail because fields are 
> duplicated, in the worst cases can lead to dataloss.
> The problem introduced here [https://github.com/apache/hudi/pull/5610/files]
> And for other operations like {{delete}} the same approach used as I use now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7909) Add Comment to the FieldSchema returned by Aws Glue Client

2024-06-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7909:
-
Labels: pull-request-available  (was: )

> Add Comment to the FieldSchema returned by Aws Glue Client 
> ---
>
> Key: HUDI-7909
> URL: https://issues.apache.org/jira/browse/HUDI-7909
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vamsi Karnika
>Priority: Major
>  Labels: pull-request-available
>
> The Implementation of getMetastoreFieldSchema by AwsGlueCatalogSyncClient 
> doesn't included comment as part of the FieldSchema. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7906) improve the parallelism deduce in rdd write

2024-06-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7906:
-
Labels: pull-request-available  (was: )

> improve the parallelism deduce in rdd write
> ---
>
> Key: HUDI-7906
> URL: https://issues.apache.org/jira/browse/HUDI-7906
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
>  Labels: pull-request-available
>
> as [https://github.com/apache/hudi/issues/11274] and 
> [https://github.com/apache/hudi/pull/11463] describe, there has two case 
> question.
>  # if the rdd is input rdd without shuffle, the partitiion number is too 
> bigger or too small
>  # user need can not control it easy
>  ## in some case user can set `spark.default.parallelism` change it.
>  ## in some case user can not change because hard-code
>  ## and in spark, the better way is use `spark.default.parallelism` or 
> `spark.sql.shuffle.partitions` can control it, other is advanced in hudi.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7877) Add record position to record index metadata payload

2024-06-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7877:
-
Labels: pull-request-available  (was: )

> Add record position to record index metadata payload
> 
>
> Key: HUDI-7877
> URL: https://issues.apache.org/jira/browse/HUDI-7877
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> RLI should save the record position so that can be used in the index lookup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7892) Building workload support set parallelism

2024-06-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7892:
-
Labels: pull-request-available  (was: )

> Building workload support set parallelism
> -
>
> Key: HUDI-7892
> URL: https://issues.apache.org/jira/browse/HUDI-7892
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
>
> Building workload support set parallelism



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7891) Fix HoodieActiveTimeline#deleteCompletedRollback missing check for Action type

2024-06-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7891:
-
Labels: pull-request-available  (was: )

> Fix HoodieActiveTimeline#deleteCompletedRollback missing check for Action type
> --
>
> Key: HUDI-7891
> URL: https://issues.apache.org/jira/browse/HUDI-7891
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: bradley
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7881) Handle table base path changes in meta syncs.

2024-06-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7881:
-
Labels: pull-request-available  (was: )

> Handle table base path changes in meta syncs.
> -
>
> Key: HUDI-7881
> URL: https://issues.apache.org/jira/browse/HUDI-7881
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer, meta-sync
>Reporter: Vinish Reddy
>Assignee: Vinish Reddy
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7880) Support extraMetadata in Spark SQL Insert Into

2024-06-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7880:
-
Labels: pull-request-available  (was: )

> Support extraMetadata in Spark SQL Insert Into
> --
>
> Key: HUDI-7880
> URL: https://issues.apache.org/jira/browse/HUDI-7880
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Users want to implement checkpoints similar to those in Hudi DeltaStreamer. 
> DeltaStreamer is implemented by saving values to extrametadata in a commit 
> file, with the key deltastreamer.checkpoint.key. We can achieve this in Spark 
> Client by configuring the parameter `house. datasource. write. commonmeta. 
> key. prefix`, but in Spark SQL, it is restricted that the prefix of the 
> configuration parameter must be `hoodie.`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7879) Optimize the redundant creation of HoodieTable in DataSourceInternalWriterHelper and the unnecessary parameters in createTable within BaseHoodieWriteClient.

2024-06-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7879:
-
Labels: pull-request-available  (was: )

> Optimize the redundant creation of HoodieTable in 
> DataSourceInternalWriterHelper and the unnecessary parameters in createTable 
> within BaseHoodieWriteClient.
> 
>
> Key: HUDI-7879
> URL: https://issues.apache.org/jira/browse/HUDI-7879
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ma Jian
>Priority: Major
>  Labels: pull-request-available
>
> In the initialization method of DataSourceInternalWriterHelper, it currently 
> creates two identical HoodieTable instances. We should remove one of them. 
> Also, when comparing the differences between the two HoodieTable instances, I 
> noticed that the createTable method in BaseHoodieWriteClient includes a 
> HadoopConfiguration parameter that isn't used by any implemented methods. I'm 
> not sure why it was designed this way, but I think we can remove it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7876) Use TypedProperties to store the spillable map configs for the FG reader

2024-06-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7876:
-
Labels: pull-request-available  (was: )

> Use TypedProperties to store the spillable map configs for the FG reader
> 
>
> Key: HUDI-7876
> URL: https://issues.apache.org/jira/browse/HUDI-7876
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> This takes up 4 params for the fg reader that can just be stored in the 
> TypedProperties that is already passed in.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7874) Fail to read 2-level structure Parquet

2024-06-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7874:
-
Labels: pull-request-available  (was: )

> Fail to read 2-level structure Parquet
> --
>
> Key: HUDI-7874
> URL: https://issues.apache.org/jira/browse/HUDI-7874
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vitali Makarevich
>Priority: Major
>  Labels: pull-request-available
>
> If I have {{"spark.hadoop.parquet.avro.write-old-list-structure", "false"}} 
> explicitly set - to being able to write nulls inside arrays(the only way), 
> Hudi starts to write Parquets with the following schema inside:
>  {{   required group internal_list (LIST) \{
> repeated group list {
>   required int64 element;
> }
>   }}}
>  
> But if I had some files produced before setting 
> {{{}"spark.hadoop.parquet.avro.write-old-list-structure", "false"{}}}, they 
> have the following schema inside
>  {{  required group internal_list (LIST) \{
> repeated int64 array;
>   }}}
>  
> And Hudi 0.14.x at least fails to read records from such file - failing with 
> exception
> {{Caused by: java.lang.RuntimeException: Null-value for required field: }}
> Even though the contents of arrays is {{{}not null{}}}(it cannot be null in 
> fact since Avro requires 
> {{spark.hadoop.parquet.avro.write-old-list-structure}} = {{false}} to write 
> {{{}null{}}}s.
> h3. Expected behavior
> Taken from Hudi 0.12.1(not sure what exactly broke that):
>  # If I have a file with 2 level structure and update(not matter having nulls 
> inside array or not - both produce the same) arrives with 
> "spark.hadoop.parquet.avro.write-old-list-structure", "false" - overwrite it 
> into 3 level.({*}fails in 0.14.1{*})
>  # If I have 3 level structure with nulls and update cames(not matter with 
> nulls or without) - read and write correctly
> The simple reproduction of issue can be found here:
> [https://github.com/VitoMakarevich/hudi-issue-014]
> Highly likely, the problem appeared after Hudi made some changes, so values 
> from Hadoop conf started to propagate into Reader instance(likely they were 
> not propagated before).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7875) Remove tablePath from HoodieFileGroupReader

2024-06-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7875:
-
Labels: pull-request-available  (was: )

> Remove tablePath from HoodieFileGroupReader
> ---
>
> Key: HUDI-7875
> URL: https://issues.apache.org/jira/browse/HUDI-7875
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> tablePath is stored in the metaclient which is a param.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7873) Remove getStorage method from HoodieReaderContext

2024-06-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7873:
-
Labels: pull-request-available  (was: )

> Remove getStorage method from HoodieReaderContext
> -
>
> Key: HUDI-7873
> URL: https://issues.apache.org/jira/browse/HUDI-7873
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> All implementations of the method were the same, and it was only used by a 
> test method becuase storage is passed as a param to the fg reader.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7872) Recreate Glue table on certain types of exceptions

2024-06-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7872:
-
Labels: pull-request-available  (was: )

> Recreate Glue table on certain types of exceptions
> --
>
> Key: HUDI-7872
> URL: https://issues.apache.org/jira/browse/HUDI-7872
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vamsi Karnika
>Priority: Major
>  Labels: pull-request-available
>
> If there are certain types of exceptions (schema changes, unable to add 
> partitions) re-create the Glue table so that the table continues to be 
> queryable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7871) Remove tableconfig from HoodieFilegroupReader params

2024-06-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7871:
-
Labels: pull-request-available  (was: )

> Remove tableconfig from HoodieFilegroupReader params
> 
>
> Key: HUDI-7871
> URL: https://issues.apache.org/jira/browse/HUDI-7871
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> In prod usages, we just get the tableconfigs from the metaclient. The 
> constructor has too many params so getting rid of one will be useful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7867) Data deduplication caused by drawback in the delete invalid files before commit

2024-06-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7867:
-
Labels: pull-request-available  (was: )

> Data deduplication caused by drawback in the delete invalid files before 
> commit
> ---
>
> Key: HUDI-7867
> URL: https://issues.apache.org/jira/browse/HUDI-7867
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: Jing Zhang
>Priority: Major
>  Labels: pull-request-available
>
> Our user complained that after their daily run job which written to a Hudi 
> cow table finished, the downstream reading jobs find many duplicate records 
> today. The daily run job has been already online for a long time, and this is 
> the first time of such wrong result.
> He gives a detailed deduplicated record as example to help debug. The record 
> appeared in 3 base files which belongs to different file groups.
> [!https://private-user-images.githubusercontent.com/1525333/337907952-60b95dc4-91d6-4b40-8bca-c877a4407ae0.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkwNzk1Mi02MGI5NWRjNC05MWQ2LTRiNDAtOGJjYS1jODc3YTQ0MDdhZTAucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZTZhMThjZDdiNjNmYjYyZmU5Mjg3OWIyMTg5ZTFkNDBmMTc5NjliZjFjMjQwZWQwM2JjZjMxNDU4ZDA3NzZhZSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.ueqsTezXNbtnxyqSyzW2_v92Jc0z_7ioljutPcfcWwE|width=491!|https://private-user-images.githubusercontent.com/1525333/337907952-60b95dc4-91d6-4b40-8bca-c877a4407ae0.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkwNzk1Mi02MGI5NWRjNC05MWQ2LTRiNDAtOGJjYS1jODc3YTQ0MDdhZTAucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZTZhMThjZDdiNjNmYjYyZmU5Mjg3OWIyMTg5ZTFkNDBmMTc5NjliZjFjMjQwZWQwM2JjZjMxNDU4ZDA3NzZhZSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.ueqsTezXNbtnxyqSyzW2_v92Jc0z_7ioljutPcfcWwE]
> I find the today's writer job, the spark application finished successfully.
> In the driver log, I find those two files marked as invalid files which to 
> delete, only one file is valid files.
> [!https://private-user-images.githubusercontent.com/1525333/337909363-8e19e170-e38f-4725-82a5-84ed55750db9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkwOTM2My04ZTE5ZTE3MC1lMzhmLTQ3MjUtODJhNS04NGVkNTU3NTBkYjkucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NzUwMGQ4ODU2NDNmODFiYmE2YjA0OGIzMzBhZGU4OGMxOGYxMTNkZTJjNzZjZDI0N2YwNDRmMWMwY2ZiNWQzOSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.0RruG5y4012v6dHdoqmEEMTT2oLWjmIHQsa_JHl-vmg|width=1380!|https://private-user-images.githubusercontent.com/1525333/337909363-8e19e170-e38f-4725-82a5-84ed55750db9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkwOTM2My04ZTE5ZTE3MC1lMzhmLTQ3MjUtODJhNS04NGVkNTU3NTBkYjkucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NzUwMGQ4ODU2NDNmODFiYmE2YjA0OGIzMzBhZGU4OGMxOGYxMTNkZTJjNzZjZDI0N2YwNDRmMWMwY2ZiNWQzOSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.0RruG5y4012v6dHdoqmEEMTT2oLWjmIHQsa_JHl-vmg]
> And in the clean stage task log, those two files are also marked to be 
> deleted and there is no exception in the task either.
> [!https://private-user-images.githubusercontent.com/1525333/33791

[jira] [Updated] (HUDI-7838) Use Config hoodie.schema.cache.enable in HoodieBaseFileGroupRecordBuffer and AbstractHoodieLogRecordReader

2024-06-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7838:
-
Labels: pull-request-available  (was: )

> Use Config hoodie.schema.cache.enable in HoodieBaseFileGroupRecordBuffer and  
> AbstractHoodieLogRecordReader
> ---
>
> Key: HUDI-7838
> URL: https://issues.apache.org/jira/browse/HUDI-7838
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: Jonathan Vexler
>Assignee: Vova Kolmakov
>Priority: Major
>  Labels: pull-request-available
>
> hoodie.schema.cache.enable should be used to decide if we want to use the 
> schema cache. Currently it is hardcoded to false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7671) Make Hudi timeline backward compatible

2024-06-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7671:
-
Labels: compatibility pull-request-available  (was: compatibility)

> Make Hudi timeline backward compatible
> --
>
> Key: HUDI-7671
> URL: https://issues.apache.org/jira/browse/HUDI-7671
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: compatibility, pull-request-available
> Fix For: 1.0.0
>
>
> Since release 1.x, the timeline metadata file name is changed to include the 
> completion time, we need to keep compatibility for 0.x branches/releases.
> 0.x meta file name pattern: ${instant_time}.action[.state]
> 1.x meta file name pattern: ${instant_time}_${completion_time}.action[.state].
> In 1.x release, while decipheriing the Hudi instant from the metadata files, 
> if there is no completion time, uses the file modification time as the 
> completion time instead.
> The modification time follows the OCC concurrency control semantics if the 
> files were not moved around.
> Caution that if the table is a MOR table and the files got moved in history 
> from old folder to the current folder, the reader view may represent wong 
> result set because the completion time are completely the same for all the 
> alive instants.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7869) Ensure properties are copied when modifying schema

2024-06-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7869:
-
Labels: pull-request-available  (was: )

> Ensure properties are copied when modifying schema
> --
>
> Key: HUDI-7869
> URL: https://issues.apache.org/jira/browse/HUDI-7869
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Properties are not always copied when we modify the schema, such as removing 
> fields.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-06-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7779:
-
Labels: pull-request-available  (was: )

> Guarding archival to not archive unintended commits
> ---
>
> Key: HUDI-7779
> URL: https://issues.apache.org/jira/browse/HUDI-7779
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: archiving
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> Archiving commits from active timeline could lead to data consistency issues 
> on some rarest of occasions. We should come up with proper guards to ensure 
> we do not make such unintended archival. 
>  
> Major gap which we wanted to guard is:
> if someone disabled cleaner, archival should account for data consistency 
> issues and ensure it bails out.
> We have a base guarding condition, where archival will stop at the earliest 
> commit to retain based on latest clean commit metadata. But there are few 
> other scenarios that needs to be accounted for. 
>  
> a. Keeping aside replace commits, lets dive into specifics for regular 
> commits and delta commits.
> Say user configured clean commits to 4 and archival configs to 5 and 6. after 
> t10, cleaner is supposed to clean up all file versions created at or before 
> t6. Say cleaner did not run(for whatever reason for next 5 commits). 
>     Archival will certainly be guarded until earliest commit to retain based 
> on latest clean commits. 
> Corner case to consider: 
> A savepoint was added to say t3 and later removed. and still the cleaner was 
> never re-enabled. Even though archival would have been stopped at t3 (until 
> savepoint is present),but once savepoint is removed, if archival is executed, 
> it could archive commit t3. Which means, file versions tracked at t3 is still 
> not yet cleaned by the cleaner. 
> Reasoning: 
> We are good here wrt data consistency. Up until cleaner runs next time, this 
> older file versions might be exposed to the end-user. But time travel query 
> is not intended for already cleaned up commits and hence this is not an 
> issue. None of snapshot, time travel query or incremental query will run into 
> issues as they are not supposed to poll for t3. 
> At any later point, if cleaner is re-enabled, it will take care of cleaning 
> up file versions tracked at t3 commit. Just that for interim period, some 
> older file versions might still be exposed to readers. 
>  
> b. The more tricky part is when replace commits are involved. Since replace 
> commit metadata in active timeline is what ensures the replaced file groups 
> are ignored for reads, before archiving the same, cleaner is expected to 
> clean them up fully. But are there chances that this could go wrong? 
> Corner case to consider. Lets add onto above scenario, where t3 has a 
> savepoint, and t4 is a replace commit which replaced file groups tracked in 
> t3. 
> Cleaner will skip cleaning up files tracked by t3(due to the presence of 
> savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain 
> will be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
> disabled. In this state of the timeline, if archival is executed, (since 
> t3.savepoint is removed), archival might archive t3 and t4.rc.  This could 
> lead to data duplicates as both replaced file groups and new file groups from 
> t4.rc would be exposed as valid file groups. 
>  
> In other words, if we were to summarize the different scenarios: 
> i. replaced file group is never cleaned up. 
>     - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
> ii. replaced file group is cleaned up. 
>     - ECTR is > this.rc and is good to archive.
> iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
> clean up did not happen.  After savepoint is removed, and when archival is 
> executed, we should avoid archiving the rc of interest. This is the gap we 
> don't account for as of now.
>  
> We have 3 options to go about to solve this.
> Option A: 
> Let Savepoint deletion flow take care of cleaning up the files its tracking. 
> cons:
> Savepoint's responsibility is not removing any data files. So, from a single 
> user responsibility rule, this may not be right. Also, this clean up might 
> need to do what a clean planner might actually be doing. ie. build file 
> system view, understand if its supposed to be cleaned u

[jira] [Updated] (HUDI-7847) Infer record merge mode during table upgrade

2024-06-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7847:
-
Labels: pull-request-available  (was: )

> Infer record merge mode during table upgrade
> 
>
> Key: HUDI-7847
> URL: https://issues.apache.org/jira/browse/HUDI-7847
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Geser Dugarov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Record merge mode is required to dictate the merging behavior in release 1.x, 
> playing the same role as the payload class config in the release 0.x.  During 
> table upgrade, we need to infer the record merge mode based on the payload 
> class so it's correctly set.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7841) RLI and secondary index should consider only pruned partitions for file skipping

2024-06-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7841:
-
Labels: pull-request-available  (was: )

> RLI and secondary index should consider only pruned partitions for file 
> skipping
> 
>
> Key: HUDI-7841
> URL: https://issues.apache.org/jira/browse/HUDI-7841
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Even though RLI scans only matching files, it tries to get those candidate 
> files by iterating over all files from file index. See - 
> [https://github.com/apache/hudi/blob/f4be74c29471fbd6afff472f8db292e6b1f16f05/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala#L47]
> Instead, it can use the `prunedPartitionsAndFileSlices` to only consider 
> pruned partitions whenever there is a partition predicate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7855) Add ability to dynamically configure write parallelism for BULK_INSERT for HoodieStreamer

2024-06-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7855:
-
Labels: pull-request-available  (was: )

> Add ability to dynamically configure write parallelism for BULK_INSERT for 
> HoodieStreamer
> -
>
> Key: HUDI-7855
> URL: https://issues.apache.org/jira/browse/HUDI-7855
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Rajesh Mahindra
>Assignee: Rajesh Mahindra
>Priority: Major
>  Labels: pull-request-available
>
> Add ability to dynamically configure write parallelism for BULK_INSERT for 
> HoodieStreamer. Currently, BULK_INSERT parallelism to configured based on 
> source parallelism that may be aggressive or conservative depending on other 
> factors, e.g. partitions written to etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7854) Bump AWS SDK v2 version to 2.25.69

2024-06-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7854:
-
Labels: pull-request-available  (was: )

> Bump AWS SDK v2 version to 2.25.69
> --
>
> Key: HUDI-7854
> URL: https://issues.apache.org/jira/browse/HUDI-7854
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> The current version of AWS SDK v2 used is 2.18.40 which is 1.5 years old.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7853) Fix missing serDe properties post migration from hiveSync to glueSync

2024-06-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7853:
-
Labels: pull-request-available  (was: )

> Fix missing serDe properties post migration from hiveSync to glueSync
> -
>
> Key: HUDI-7853
> URL: https://issues.apache.org/jira/browse/HUDI-7853
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prathit Malik
>Assignee: Prathit Malik
>Priority: Major
>  Labels: pull-request-available
>
> More info : [https://github.com/apache/hudi/issues/11397]
>  
> After migration to 0.13.1, hudi table path is missing from serde properties 
> due to which when reading from spark below error is thrown
> - org.apache.hudi.exception.HoodieException: 'path' or 'Key: 
> 'hoodie.datasource.read.paths' , default: null description: Comma separated 
> list of file paths to read within a Hudi table. since version: version is not 
> defined deprecated after: version is not defined)' or both must be specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


<    1   2   3   4   5   6   7   8   9   10   >