[jira] [Updated] (HUDI-7877) Add record position to record index metadata payload

2024-06-18 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7877:
--
Status: Patch Available  (was: In Progress)

> Add record position to record index metadata payload
> 
>
> Key: HUDI-7877
> URL: https://issues.apache.org/jira/browse/HUDI-7877
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> RLI should save the record position so that can be used in the index lookup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7841) RLI and secondary index should consider only pruned partitions for file skipping

2024-06-18 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7841:
--
Status: Patch Available  (was: In Progress)

> RLI and secondary index should consider only pruned partitions for file 
> skipping
> 
>
> Key: HUDI-7841
> URL: https://issues.apache.org/jira/browse/HUDI-7841
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Even though RLI scans only matching files, it tries to get those candidate 
> files by iterating over all files from file index. See - 
> [https://github.com/apache/hudi/blob/f4be74c29471fbd6afff472f8db292e6b1f16f05/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala#L47]
> Instead, it can use the `prunedPartitionsAndFileSlices` to only consider 
> pruned partitions whenever there is a partition predicate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7877) Add record position to record index metadata payload

2024-06-18 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7877:
--
Status: In Progress  (was: Open)

> Add record position to record index metadata payload
> 
>
> Key: HUDI-7877
> URL: https://issues.apache.org/jira/browse/HUDI-7877
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> RLI should save the record position so that can be used in the index lookup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7905) New Action for Clustering

2024-06-18 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7905:
--
Status: In Progress  (was: Open)

> New Action for Clustering
> -
>
> Key: HUDI-7905
> URL: https://issues.apache.org/jira/browse/HUDI-7905
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
> Fix For: 1.0.0
>
>
> Currently, we use replacecommit for clustering, insert overwrite and delete 
> partition. Clustering should be a separate action. This simplifies a few 
> things such as we do not need to scan the replacecommit.requested to 
> determine whether we are looking at clustering plan or not. This also 
> standardizes the usage of replacecommit to some extent (related to HUDI-1739).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7841) RLI and secondary index should consider only pruned partitions for file skipping

2024-06-18 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7841:
--
Status: In Progress  (was: Open)

> RLI and secondary index should consider only pruned partitions for file 
> skipping
> 
>
> Key: HUDI-7841
> URL: https://issues.apache.org/jira/browse/HUDI-7841
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Even though RLI scans only matching files, it tries to get those candidate 
> files by iterating over all files from file index. See - 
> [https://github.com/apache/hudi/blob/f4be74c29471fbd6afff472f8db292e6b1f16f05/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala#L47]
> Instead, it can use the `prunedPartitionsAndFileSlices` to only consider 
> pruned partitions whenever there is a partition predicate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7907) Validate new file slicing on table with mix of older and new log files

2024-06-18 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856090#comment-17856090
 ] 

Danny Chen commented on HUDI-7907:
--

The file slicing is based on log file completion time but the log file naming 
convention has changed, as we discussed, we should do a full compaction before 
upgrade right? It is not a wise choice to keep compatibility for log file 
naming resolving because that is a hot spot code path.

> Validate new file slicing on table with mix of older and new log files
> --
>
> Key: HUDI-7907
> URL: https://issues.apache.org/jira/browse/HUDI-7907
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>
> Log files naming has changed i.e. now we have deltacommit time instead of 
> base commit time in the log file name. Could there be an edge case that file 
> slicing could be incorrect if we have a mix of older and new log files within 
> the same filegroup. Because the `HoodieLogFile#getDeltaCommitTime` will point 
> to base commit time for older log files, while for newer ones it will point 
> to deltacommit times. Writes are still serialized because new deltacommit 
> times must be > base commit time, but we need to test the scenario fully.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7908) HoodieFileGroupReader fails if precombine and partition fields are same

2024-06-18 Thread Sagar Sumit (Jira)
Sagar Sumit created HUDI-7908:
-

 Summary: HoodieFileGroupReader fails if precombine and partition 
fields are same
 Key: HUDI-7908
 URL: https://issues.apache.org/jira/browse/HUDI-7908
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Sagar Sumit
 Fix For: 1.0.0


{code:java}
test(s"Test INSERT INTO with upsert operation type") {
  if (HoodieSparkUtils.gteqSpark3_2) {
withTempDir { tmp =>
  Seq("mor").foreach { tableType =>
val tableName = generateTableName
spark.sql(
  s"""
 |create table $tableName (
 |  id int,
 |  name string,
 |  ts long,
 |  price int
 |) using hudi
 |partitioned by (ts)
 |tblproperties (
 |  type = '$tableType',
 |  primaryKey = 'id',
 |  preCombineField = 'ts'
 |)
 |location '${tmp.getCanonicalPath}/$tableName'
 |""".stripMargin
)

// Test insert into with upsert operation type
spark.sql(
  s"""
 | insert into $tableName
 | values (1, 'a1', 1000, 10), (2, 'a2', 2000, 20), (3, 'a3', 3000, 
30), (4, 'a4', 2000, 10), (5, 'a5', 3000, 20), (6, 'a6', 4000, 30)
 | """.stripMargin
)
checkAnswer(s"select id, name, price, ts from $tableName where 
price>3000")(
  Seq(6, "a6", 4000, 30)
)

// Test update
spark.sql(s"update $tableName set price = price + 1 where id = 6")
checkAnswer(s"select id, name, price, ts from $tableName where 
price>3000")(
  Seq(6, "a6", 4001, 30)
)
  }
}
  }
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0

2024-06-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7882:
--
Description: 
We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
umbrella ticket to track all of them.

 

Changes required to be ported: 
0. Creating 0.16.0 branch

0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 

 

1. Timeline 

1.a Hoodie instant parsing should be able to read 1.x instants. 
https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 

1.b Commit metadata parsing is able to handle both json and avro formats. Scope 
might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  Siva.
1.c HoodieDefaultTimeline able to read both timelines based on table version.  
https://issues.apache.org/jira/browse/HUDI-7884 Siva.

1.d Reading LSM timeline using 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7890 Siva. 

1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901

 

2. Table property changes 

2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
https://issues.apache.org/jira/browse/HUDI-7865 LJ

 

3. MDT table changes

3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ

3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 
LJ

 

4. Log format changes

4.a All metadata header types porting 
https://issues.apache.org/jira/browse/HUDI-7887 Jon

4.b Meaningful error for incompatible features from 1.x 
https://issues.apache.org/jira/browse/HUDI-7888 Jon

 

5. Log file slice or grouping detection compatibility 

 

5. Tests 

5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 

 

6 Doc changes 

6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
https://issues.apache.org/jira/browse/HUDI-7889 

  was:
We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
umbrella ticket to track all of them.

 

Changes required to be ported: 
0. Creating 0.16.0 branch

0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 

 

1. Timeline 

1.a Hoodie instant parsing should be able to read 1.x instants. 
https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 

1.b Commit metadata parsing is able to handle both json and avro formats. Scope 
might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  Siva.
1.c HoodieDefaultTimeline able to read both timelines based on table version.  
https://issues.apache.org/jira/browse/HUDI-7884 Siva.

1.d Reading LSM timeline using 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7890 Siva. 

1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901

 

2. Table property changes 

2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
https://issues.apache.org/jira/browse/HUDI-7865 LJ

 

3. MDT table changes

3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ

3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 
LJ

 

4. Log format changes

4.a All metadata header types porting 
https://issues.apache.org/jira/browse/HUDI-7887 Jon

4.b Meaningful error for incompatible features from 1.x 
https://issues.apache.org/jira/browse/HUDI-7888 Jon

 

5. Tests 

5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 

 

6 Doc changes 

6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
https://issues.apache.org/jira/browse/HUDI-7889 


> Umbrella ticket to track all changes required to support reading 1.x tables 
> with 0.16.0 
> 
>
> Key: HUDI-7882
> URL: https://issues.apache.org/jira/browse/HUDI-7882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: sivabalan narayanan
>Priority: Major
>
> We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
> umbrella ticket to track all of them.
>  
> Changes required to be ported: 
> 0. Creating 0.16.0 branch
> 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 
>  
> 1. Timeline 
> 1.a Hoodie instant parsing should be able to read 1.x instants. 
> https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 
> 1.b Commit metadata parsing is able to handle both json and avro formats. 
> Scope might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  
> Siva.
> 1.c HoodieDefaultTimeline able to read both timelines based on table version. 
>  https://issues.apache.org/jira/browse/HUDI-7884 Siva.
> 1.d Reading LSM timeline using 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7890 Siva. 
> 1.e Ensure 1.0 MDT timeline is readable by

[jira] [Updated] (HUDI-7420) Parallelize the process of constructing `logFilesMarkerPath` in CommitMetadatautils#reconcileMetadataForMissingFiles

2024-06-18 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7420:
--
Sprint: Sprint 2024-03-25, Sprint 2024-04-26, 2024/06/03-16  (was: Sprint 
2024-03-25, Sprint 2024-04-26, 2024/06/17-30, 2024/06/03-16)

> Parallelize the process of constructing `logFilesMarkerPath` in 
> CommitMetadatautils#reconcileMetadataForMissingFiles
> 
>
> Key: HUDI-7420
> URL: https://issues.apache.org/jira/browse/HUDI-7420
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 0.16.0, 1.0.0
>
>
> This is related to HUDI-1517.
> Current logic is:
> {code:java}
> Set logFilesMarkerPath = new HashSet<>();
> allLogFilesMarkerPath.stream().filter(logFilePath -> 
> !logFilePath.endsWith("cdc")).forEach(logFilesMarkerPath::add);
> // remove valid log files
> // TODO: refactor based on HoodieData
> for (Map.Entry> partitionAndWriteStats : 
> commitMetadata.getPartitionToWriteStats().entrySet()) {
>   for (HoodieWriteStat hoodieWriteStat : partitionAndWriteStats.getValue()) {
> logFilesMarkerPath.remove(hoodieWriteStat.getPath());
>   }
> } {code}
> The for loop can be achieved via context.parallelize as below, but need to 
> check for thread-safety.
> {code:java}
> Set logFilesMarkerPath = new HashSet<>();
> allLogFilesMarkerPath.stream().filter(logFilePath -> 
> !logFilePath.endsWith("cdc")).forEach(logFilesMarkerPath::add);
> // Convert the partition and write stats to a list of log file paths to be 
> removed
> List validLogFilePaths = context.parallelize(new 
> ArrayList<>(commitMetadata.getPartitionToWriteStats().entrySet()))
> .flatMapToPair((SerializablePairFunction List>, String, Void>) entry -> {
> List> pathsToRemove = new ArrayList<>();
> entry.getValue().forEach(hoodieWriteStat -> 
> pathsToRemove.add(Pair.of(hoodieWriteStat.getPath(), null)));
> return pathsToRemove.iterator();
> })
> .map(t -> t.getLeft())
> .collect();
> // Remove the valid log file paths from logFilesMarkerPath in a parallel 
> manner
> // Depending on the specifics of your environment and HoodieEngineContext, 
> this might need to be adapted.
> // For a straightforward approach without parallelization of the remove 
> operation:
> validLogFilePaths.forEach(logFilesMarkerPath::remove); {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7874] Ensure Parquet can interoperate different level structures [hudi]

2024-06-18 Thread via GitHub


VitoMakarevich commented on PR #11461:
URL: https://github.com/apache/hudi/pull/11461#issuecomment-2177158089

   @yihua you can merge if you plan so


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]

2024-06-18 Thread via GitHub


VitoMakarevich closed pull request #11465: [HUDI-7874] Avro fix read 2 level 
and 3 level files
URL: https://github.com/apache/hudi/pull/11465


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0

2024-06-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7882:
--
Description: 
We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
umbrella ticket to track all of them.

 

Changes required to be ported: 
0. Creating 0.16.0 branch

0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 

 

1. Timeline 

1.a Hoodie instant parsing should be able to read 1.x instants. 
https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 

1.b Commit metadata parsing is able to handle both json and avro formats. Scope 
might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  Siva.
1.c HoodieDefaultTimeline able to read both timelines based on table version.  
https://issues.apache.org/jira/browse/HUDI-7884 Siva.

1.d Reading LSM timeline using 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7890 Siva. 

1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901

 

2. Table property changes 

2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
https://issues.apache.org/jira/browse/HUDI-7865 LJ

 

3. MDT table changes

3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ

3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 
LJ

 

4. Log format changes

4.a All metadata header types porting 
https://issues.apache.org/jira/browse/HUDI-7887 Jon

4.b Meaningful error for incompatible features from 1.x 
https://issues.apache.org/jira/browse/HUDI-7888 Jon

 

5. Tests 

5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 

 

6 Doc changes 

6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
https://issues.apache.org/jira/browse/HUDI-7889 

  was:
We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
umbrella ticket to track all of them.

 

Changes required to be ported: 
0. Creating 0.16.0 branch

0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 

 

1. Timeline 

1.a Commit instant parsing should be able to read 1.x instants. 
https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 

1.b Commit metadata parsing is able to handle both json and avro formats. Scope 
might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  Siva.
1.c HoodieDefaultTimeline able to read both timelines based on table version.  
https://issues.apache.org/jira/browse/HUDI-7884 Siva.

1.d Reading LSM timeline using 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7890 Siva. 

1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901

 

2. Table property changes 

2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
https://issues.apache.org/jira/browse/HUDI-7865 LJ

 

3. MDT table changes

3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ

3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 
LJ

 

4. Log format changes

4.a All metadata header types porting 
https://issues.apache.org/jira/browse/HUDI-7887 Jon

4.b Meaningful error for incompatible features from 1.x 
https://issues.apache.org/jira/browse/HUDI-7888 Jon

 

5. Tests 

5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 

 

6 Doc changes 

6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
https://issues.apache.org/jira/browse/HUDI-7889 


> Umbrella ticket to track all changes required to support reading 1.x tables 
> with 0.16.0 
> 
>
> Key: HUDI-7882
> URL: https://issues.apache.org/jira/browse/HUDI-7882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: sivabalan narayanan
>Priority: Major
>
> We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
> umbrella ticket to track all of them.
>  
> Changes required to be ported: 
> 0. Creating 0.16.0 branch
> 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 
>  
> 1. Timeline 
> 1.a Hoodie instant parsing should be able to read 1.x instants. 
> https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 
> 1.b Commit metadata parsing is able to handle both json and avro formats. 
> Scope might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  
> Siva.
> 1.c HoodieDefaultTimeline able to read both timelines based on table version. 
>  https://issues.apache.org/jira/browse/HUDI-7884 Siva.
> 1.d Reading LSM timeline using 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7890 Siva. 
> 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901
>  
> 2. Table property changes 
> 2.a Ta

[jira] [Updated] (HUDI-7904) RLI not skipping data files

2024-06-18 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7904:
--
Sprint: 2024/06/17-30

> RLI not skipping data files
> ---
>
> Key: HUDI-7904
> URL: https://issues.apache.org/jira/browse/HUDI-7904
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 1.0.0-beta2, 1.0.0
>
> Attachments: image (9).png
>
>
> Enable RLI on record key field and do a query with record key field equals 
> predicate:
> SELECT id, rider, driver FROM hudi_table WHERE id = 'trip1';
> Spark UI still shows 4 files scanned, however as per the predicate only 1 
> file qualifies.
> !image (9).png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7905) New Action for Clustering

2024-06-18 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7905:
--
Sprint: 2024/06/17-30

> New Action for Clustering
> -
>
> Key: HUDI-7905
> URL: https://issues.apache.org/jira/browse/HUDI-7905
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
> Fix For: 1.0.0
>
>
> Currently, we use replacecommit for clustering, insert overwrite and delete 
> partition. Clustering should be a separate action. This simplifies a few 
> things such as we do not need to scan the replacecommit.requested to 
> determine whether we are looking at clustering plan or not. This also 
> standardizes the usage of replacecommit to some extent (related to HUDI-1739).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7907) Validate new file slicing on table with mix of older and new log files

2024-06-18 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7907:
--
Sprint: 2024/06/17-30

> Validate new file slicing on table with mix of older and new log files
> --
>
> Key: HUDI-7907
> URL: https://issues.apache.org/jira/browse/HUDI-7907
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>
> Log files naming has changed i.e. now we have deltacommit time instead of 
> base commit time in the log file name. Could there be an edge case that file 
> slicing could be incorrect if we have a mix of older and new log files within 
> the same filegroup. Because the `HoodieLogFile#getDeltaCommitTime` will point 
> to base commit time for older log files, while for newer ones it will point 
> to deltacommit times. Writes are still serialized because new deltacommit 
> times must be > base commit time, but we need to test the scenario fully.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7903) Partition Stats Index not getting created with SQL

2024-06-18 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7903:
--
Sprint: 2024/06/17-30

> Partition Stats Index not getting created with SQL
> --
>
> Key: HUDI-7903
> URL: https://issues.apache.org/jira/browse/HUDI-7903
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 1.0.0-beta2, 1.0.0
>
>
> {code:java}
> spark.sql(
>   s"""
>  | create table $tableName using hudi
>  | partitioned by (dt)
>  | tblproperties(
>  |primaryKey = 'id',
>  |preCombineField = 'ts',
>  |'hoodie.metadata.index.partition.stats.enable' = 'true'
>  | )
>  | location '$tablePath'
>  | AS
>  | select 1 as id, 'a1' as name, 10 as price, 1000 as ts, 
> cast('2021-05-06' as date) as dt
>""".stripMargin
> ) {code}
> Even when partition stats is enabled, index is not created with SQL. Works 
> for datasource.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7907) Validate new file slicing on table with mix of older and new log files

2024-06-18 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7907:
--
Fix Version/s: 1.0.0

> Validate new file slicing on table with mix of older and new log files
> --
>
> Key: HUDI-7907
> URL: https://issues.apache.org/jira/browse/HUDI-7907
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>
> Log files naming has changed i.e. now we have deltacommit time instead of 
> base commit time in the log file name. Could there be an edge case that file 
> slicing could be incorrect if we have a mix of older and new log files within 
> the same filegroup. Because the `HoodieLogFile#getDeltaCommitTime` will point 
> to base commit time for older log files, while for newer ones it will point 
> to deltacommit times. Writes are still serialized because new deltacommit 
> times must be > base commit time, but we need to test the scenario fully.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7907) Validate new file slicing on table with mix of older and new log files

2024-06-18 Thread Sagar Sumit (Jira)
Sagar Sumit created HUDI-7907:
-

 Summary: Validate new file slicing on table with mix of older and 
new log files
 Key: HUDI-7907
 URL: https://issues.apache.org/jira/browse/HUDI-7907
 Project: Apache Hudi
  Issue Type: Task
Reporter: Sagar Sumit
Assignee: Danny Chen


Log files naming has changed i.e. now we have deltacommit time instead of base 
commit time in the log file name. Could there be an edge case that file slicing 
could be incorrect if we have a mix of older and new log files within the same 
filegroup. Because the `HoodieLogFile#getDeltaCommitTime` will point to base 
commit time for older log files, while for newer ones it will point to 
deltacommit times. Writes are still serialized because new deltacommit times 
must be > base commit time, but we need to test the scenario fully.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7877] Add record position to record index metadata payload [hudi]

2024-06-18 Thread via GitHub


codope commented on code in PR #11467:
URL: https://github.com/apache/hudi/pull/11467#discussion_r1645029522


##
hudi-common/src/main/avro/HoodieMetadata.avsc:
##
@@ -427,6 +427,15 @@
 "type": "int",
 "default": 0,
 "doc": "Represents fileId encoding. Possible 
values are 0 and 1. O represents UUID based fileID, and 1 represents raw string 
format of the fileId. \nWhen the encoding is 0, reader can deduce fileID from 
fileIdLowBits, fileIdHighBits and fileIndex."
+},
+{
+"name": "position",
+"type": [
+"null",
+"long"
+],
+"default": null,

Review Comment:
   Should we instead have adefault of `-1L`? cc @yihua @nsivabalan 
   That's what we use in `HoodieRecordLocation` - 
https://github.com/apache/hudi/blob/9f0130442a502bff6d6f7a649a7808a03d51da41/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordLocation.java#L46



##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java:
##
@@ -156,6 +156,7 @@ public class HoodieMetadataPayload implements 
HoodieRecordPayload 
createRecordIndexUpdate(String
   fileIndex,
   "",
   instantTimeMillis,
-  0));
+  0,
+  null));

Review Comment:
   note: might change if we decide to use -1 as default



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11465:
URL: https://github.com/apache/hudi/pull/11465#issuecomment-2176896279

   
   ## CI report:
   
   * 753044eb444c4f022f7bd5045a2e6df8d7dda4b0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24452)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Ensure Parquet can interoperate different level structures [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11461:
URL: https://github.com/apache/hudi/pull/11461#issuecomment-2176806759

   
   ## CI report:
   
   * 16afb3d821f1fd35beff26f697016826bcf55491 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24451)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11465:
URL: https://github.com/apache/hudi/pull/11465#issuecomment-2176794046

   
   ## CI report:
   
   * 753044eb444c4f022f7bd5045a2e6df8d7dda4b0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24452)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Ensure Parquet can interoperate different level structures [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11461:
URL: https://github.com/apache/hudi/pull/11461#issuecomment-2176793952

   
   ## CI report:
   
   * 16afb3d821f1fd35beff26f697016826bcf55491 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24451)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11465:
URL: https://github.com/apache/hudi/pull/11465#issuecomment-2176780305

   
   ## CI report:
   
   * 753044eb444c4f022f7bd5045a2e6df8d7dda4b0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Ensure Parquet can interoperate different level structures [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11461:
URL: https://github.com/apache/hudi/pull/11461#issuecomment-2176780178

   
   ## CI report:
   
   * 16afb3d821f1fd35beff26f697016826bcf55491 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7874] Fix Hudi being able to read 2-level structure (#11450)

2024-06-18 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 9f0130442a5 [HUDI-7874] Fix Hudi being able to read 2-level structure 
(#11450)
9f0130442a5 is described below

commit 9f0130442a502bff6d6f7a649a7808a03d51da41
Author: Vitali Makarevich 
AuthorDate: Tue Jun 18 20:32:35 2024 +0200

[HUDI-7874] Fix Hudi being able to read 2-level structure (#11450)

Co-authored-by: vmakarevich 
---
 .../hudi/io/hadoop/HoodieAvroParquetReader.java|   2 +-
 .../apache/parquet/avro/HoodieAvroReadSupport.java |  23 +-
 .../hudi/TestParquetReaderCompatibility.scala  | 325 +
 3 files changed, 344 insertions(+), 6 deletions(-)

diff --git 
a/hudi-hadoop-common/src/main/java/org/apache/hudi/io/hadoop/HoodieAvroParquetReader.java
 
b/hudi-hadoop-common/src/main/java/org/apache/hudi/io/hadoop/HoodieAvroParquetReader.java
index dfbf4801687..bf1e4218364 100644
--- 
a/hudi-hadoop-common/src/main/java/org/apache/hudi/io/hadoop/HoodieAvroParquetReader.java
+++ 
b/hudi-hadoop-common/src/main/java/org/apache/hudi/io/hadoop/HoodieAvroParquetReader.java
@@ -166,7 +166,7 @@ public class HoodieAvroParquetReader extends 
HoodieAvroFileReader {
 // NOTE: We have to set both Avro read-schema and projection schema to make
 //   sure that in case the file-schema is not equal to read-schema 
we'd still
 //   be able to read that file (in case projection is a proper one)
-Configuration hadoopConf = storage.getConf().unwrapAs(Configuration.class);
+Configuration hadoopConf = 
storage.getConf().unwrapCopyAs(Configuration.class);
 if (!requestedSchema.isPresent()) {
   AvroReadSupport.setAvroReadSchema(hadoopConf, schema);
   AvroReadSupport.setRequestedProjection(hadoopConf, schema);
diff --git 
a/hudi-hadoop-common/src/main/java/org/apache/parquet/avro/HoodieAvroReadSupport.java
 
b/hudi-hadoop-common/src/main/java/org/apache/parquet/avro/HoodieAvroReadSupport.java
index 326accb66b2..07015209435 100644
--- 
a/hudi-hadoop-common/src/main/java/org/apache/parquet/avro/HoodieAvroReadSupport.java
+++ 
b/hudi-hadoop-common/src/main/java/org/apache/parquet/avro/HoodieAvroReadSupport.java
@@ -46,11 +46,7 @@ public class HoodieAvroReadSupport extends 
AvroReadSupport {
   @Override
   public ReadContext init(Configuration configuration, Map 
keyValueMetaData, MessageType fileSchema) {
 boolean legacyMode = checkLegacyMode(fileSchema.getFields());
-// support non-legacy list
-if (!legacyMode && 
configuration.get(AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE) == null) {
-  configuration.set(AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE,
-  "false", "support reading avro from non-legacy map/list in parquet 
file");
-}
+adjustConfToReadWithFileProduceMode(legacyMode, configuration);
 ReadContext readContext = super.init(configuration, keyValueMetaData, 
fileSchema);
 MessageType requestedSchema = readContext.getRequestedSchema();
 // support non-legacy map. Convert non-legacy map to legacy map
@@ -62,6 +58,23 @@ public class HoodieAvroReadSupport extends 
AvroReadSupport {
 return new ReadContext(requestedSchema, 
readContext.getReadSupportMetadata());
   }
 
+  /**
+   * Here we want set config with which file has been written.
+   * Even though user may have overwritten {@link 
AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE},
+   * it's only applicable to how to produce new files(here is a read path).
+   * Later the config value {@link AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE} 
will still be used
+   * to write new file according to the user preferences.
+   **/
+  private void adjustConfToReadWithFileProduceMode(boolean 
isLegacyModeWrittenFile, Configuration configuration) {
+if (isLegacyModeWrittenFile) {
+  configuration.set(AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE,
+  "true", "support reading avro from legacy map/list in parquet file");
+} else {
+  configuration.set(AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE,
+  "false", "support reading avro from non-legacy map/list in parquet 
file");
+}
+  }
+
   /**
* Check whether write map/list with legacy mode.
* legacy:
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestParquetReaderCompatibility.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestParquetReaderCompatibility.scala
new file mode 100644
index 000..c5f91657f12
--- /dev/null
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestParquetReaderCompatibility.scala
@@ -0,0 +1,325 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses th

Re: [PR] [HUDI-7874] Fix Hudi being able to read 2-level structure [hudi]

2024-06-18 Thread via GitHub


yihua merged PR #11450:
URL: https://github.com/apache/hudi/pull/11450


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] add show create table command [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11471:
URL: https://github.com/apache/hudi/pull/11471#issuecomment-2176612950

   
   ## CI report:
   
   * c472e2ad91d62204b38aa15d92fe60ca528b6275 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24455)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] add show create table command [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11471:
URL: https://github.com/apache/hudi/pull/11471#issuecomment-2176599395

   
   ## CI report:
   
   * c472e2ad91d62204b38aa15d92fe60ca528b6275 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7906] improve the parallelism deduce in rdd write [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11470:
URL: https://github.com/apache/hudi/pull/11470#issuecomment-2176584962

   
   ## CI report:
   
   * c4bf9390e01ce6cca6627d5d8c592413121386c2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24453)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] add show create table command [hudi]

2024-06-18 Thread via GitHub


houyuting opened a new pull request, #11471:
URL: https://github.com/apache/hudi/pull/11471

   ### Change Logs
   
   add show create table command feature 
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7876] use properties to store log spill map configs for fg reader (#11455)

2024-06-18 Thread jonvex
This is an automated email from the ASF dual-hosted git repository.

jonvex pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new bf1df335442 [HUDI-7876] use properties to store log spill map configs 
for fg reader (#11455)
bf1df335442 is described below

commit bf1df335442d38932cf7f8c6ef4228c316278569
Author: Jon Vexler 
AuthorDate: Tue Jun 18 12:30:56 2024 -0400

[HUDI-7876] use properties to store log spill map configs for fg reader 
(#11455)

* use properties to store log spill map configs for fg reader

* use constant for the max buffer size

* rename payloadProps to props

-

Co-authored-by: Jonathan Vexler <=>
---
 .../read/HoodieBaseFileGroupRecordBuffer.java  | 46 +-
 .../common/table/read/HoodieFileGroupReader.java   | 13 ++
 .../read/HoodieKeyBasedFileGroupRecordBuffer.java  | 10 +
 .../HoodiePositionBasedFileGroupRecordBuffer.java  | 10 +
 .../read/HoodieUnmergedFileGroupRecordBuffer.java  | 10 +
 .../table/read/TestHoodieFileGroupReaderBase.java  |  9 +++--
 .../reader/HoodieFileGroupReaderTestUtils.java | 12 +++---
 .../HoodieFileGroupReaderBasedRecordReader.java| 24 +--
 ...odieFileGroupReaderBasedParquetFileFormat.scala | 15 +++
 ...stHoodiePositionBasedFileGroupRecordBuffer.java | 11 +++---
 10 files changed, 68 insertions(+), 92 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java
index 88ec42673ac..aea50e44fbe 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java
@@ -29,6 +29,7 @@ import org.apache.hudi.common.table.HoodieTableMetaClient;
 import org.apache.hudi.common.table.log.KeySpec;
 import org.apache.hudi.common.table.log.block.HoodieDataBlock;
 import org.apache.hudi.common.util.DefaultSizeEstimator;
+import org.apache.hudi.common.util.FileIOUtils;
 import org.apache.hudi.common.util.HoodieRecordSizeEstimator;
 import org.apache.hudi.common.util.InternalSchemaCache;
 import org.apache.hudi.common.util.Option;
@@ -50,9 +51,14 @@ import java.io.IOException;
 import java.io.Serializable;
 import java.util.Collections;
 import java.util.Iterator;
+import java.util.Locale;
 import java.util.Map;
 import java.util.function.Function;
 
+import static 
org.apache.hudi.common.config.HoodieCommonConfig.DISK_MAP_BITCASK_COMPRESSION_ENABLED;
+import static 
org.apache.hudi.common.config.HoodieCommonConfig.SPILLABLE_DISK_MAP_TYPE;
+import static 
org.apache.hudi.common.config.HoodieMemoryConfig.MAX_MEMORY_FOR_MERGE;
+import static 
org.apache.hudi.common.config.HoodieMemoryConfig.SPILLABLE_MAP_BASE_PATH;
 import static 
org.apache.hudi.common.engine.HoodieReaderContext.INTERNAL_META_SCHEMA;
 import static 
org.apache.hudi.common.table.log.block.HoodieLogBlock.HeaderMetadataType.INSTANT_TIME;
 import static 
org.apache.hudi.common.table.read.HoodieFileGroupReader.getRecordMergeMode;
@@ -64,7 +70,7 @@ public abstract class HoodieBaseFileGroupRecordBuffer 
implements HoodieFileGr
   protected final Option partitionPathFieldOpt;
   protected final RecordMergeMode recordMergeMode;
   protected final HoodieRecordMerger recordMerger;
-  protected final TypedProperties payloadProps;
+  protected final TypedProperties props;
   protected final ExternalSpillableMap, 
Map>> records;
   protected ClosableIterator baseFileIterator;
   protected Iterator, Map>> logRecordIterator;
@@ -78,24 +84,26 @@ public abstract class HoodieBaseFileGroupRecordBuffer 
implements HoodieFileGr
  Option 
partitionNameOverrideOpt,
  Option 
partitionPathFieldOpt,
  HoodieRecordMerger recordMerger,
- TypedProperties payloadProps,
- long maxMemorySizeInBytes,
- String spillableMapBasePath,
- ExternalSpillableMap.DiskMapType 
diskMapType,
- boolean 
isBitCaskDiskMapCompressionEnabled) {
+ TypedProperties props) {
 this.readerContext = readerContext;
 this.readerSchema = readerContext.getSchemaHandler().getRequiredSchema();
 this.partitionNameOverrideOpt = partitionNameOverrideOpt;
 this.partitionPathFieldOpt = partitionPathFieldOpt;
-this.recordMergeMode = getRecordMergeMode(payloadProps);
+this.recordMergeMode = getRecordMergeMode(props);
 this.recordMerger = recordMerger;
 //Custom merge 

Re: [PR] [HUDI-7876] use properties to store log spill map configs for fg reader [hudi]

2024-06-18 Thread via GitHub


jonvex merged PR #11455:
URL: https://github.com/apache/hudi/pull/11455


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7906] improve the parallelism deduce in rdd write [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11470:
URL: https://github.com/apache/hudi/pull/11470#issuecomment-2176505908

   
   ## CI report:
   
   * c4bf9390e01ce6cca6627d5d8c592413121386c2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24453)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7906] improve the parallelism deduce in rdd write [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11470:
URL: https://github.com/apache/hudi/pull/11470#issuecomment-2176490891

   
   ## CI report:
   
   * c4bf9390e01ce6cca6627d5d8c592413121386c2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11465:
URL: https://github.com/apache/hudi/pull/11465#issuecomment-2176490732

   
   ## CI report:
   
   * 753044eb444c4f022f7bd5045a2e6df8d7dda4b0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24452)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Caused by: java.lang.ClassNotFoundException: org.apache.hudi.DefaultSource after hudi upgraded to 6.15 [hudi]

2024-06-18 Thread via GitHub


soumilshah1995 commented on issue #11469:
URL: https://github.com/apache/hudi/issues/11469#issuecomment-2176470646

   the error indicates that Spark cannot find the Hudi data source 
(org.apache.hudi.DefaultSource), which typically means the required Hudi jar is 
not properly included or recognized by Spark please ensure you are using right 
jar files with right version of spark 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7906) improve the parallelism deduce in rdd write

2024-06-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7906:
-
Labels: pull-request-available  (was: )

> improve the parallelism deduce in rdd write
> ---
>
> Key: HUDI-7906
> URL: https://issues.apache.org/jira/browse/HUDI-7906
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
>  Labels: pull-request-available
>
> as [https://github.com/apache/hudi/issues/11274] and 
> [https://github.com/apache/hudi/pull/11463] describe, there has two case 
> question.
>  # if the rdd is input rdd without shuffle, the partitiion number is too 
> bigger or too small
>  # user need can not control it easy
>  ## in some case user can set `spark.default.parallelism` change it.
>  ## in some case user can not change because hard-code
>  ## and in spark, the better way is use `spark.default.parallelism` or 
> `spark.sql.shuffle.partitions` can control it, other is advanced in hudi.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7906] improve the parallelism deduce in rdd write [hudi]

2024-06-18 Thread via GitHub


KnightChess opened a new pull request, #11470:
URL: https://github.com/apache/hudi/pull/11470

   ### Change Logs
   
   as https://github.com/apache/hudi/issues/11274 and 
https://github.com/apache/hudi/pull/11463 describe, there has two case question.
   
   - if the rdd is input rdd without shuffle, the partitiion number is too 
bigger or too small
   - user need can not control it easy
 - in some case user can set `spark.default.parallelism` change it.
 - in some case user can not change because hard-code
 - and in spark, the better way is use `spark.default.parallelism` or 
`spark.sql.shuffle.partitions` can control it, other is advanced in hudi.
   
   ### Impact
   
   like dedup where use new deduce logical, user can use 
`spark.sql.shuffle.partitions` or `spark.default.parallelism` control the 
parallelism.
   For special scenes, also can use advanced params.
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   None
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7906) improve the parallelism deduce in rdd write

2024-06-18 Thread KnightChess (Jira)
KnightChess created HUDI-7906:
-

 Summary: improve the parallelism deduce in rdd write
 Key: HUDI-7906
 URL: https://issues.apache.org/jira/browse/HUDI-7906
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: KnightChess
Assignee: KnightChess


as [https://github.com/apache/hudi/issues/11274] and 
[https://github.com/apache/hudi/pull/11463] describe, there has two case 
question.
 # if the rdd is input rdd without shuffle, the partitiion number is too bigger 
or too small
 # user need can not control it easy
 ## in some case user can set `spark.default.parallelism` change it.
 ## in some case user can not change because hard-code
 ## and in spark, the better way is use `spark.default.parallelism` or 
`spark.sql.shuffle.partitions` can control it, other is advanced in hudi.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7876] use properties to store log spill map configs for fg reader [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11455:
URL: https://github.com/apache/hudi/pull/11455#issuecomment-2176358198

   
   ## CI report:
   
   * 17dfb2b314e57e251901611d90d35231c701f167 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24436)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Caused by: org.apache.hudi.exception.HoodieException: Executor executes action [commits the instant 20240618064120870] error [hudi]

2024-06-18 Thread via GitHub


ankit0811 commented on issue #11466:
URL: https://github.com/apache/hudi/issues/11466#issuecomment-2176336217

   Hmm. I dint find any relevant errors in the tm logs. 
   
   Changed the IGNORE_KEY to true and it seems to be working but I dont see any 
data in the parquet files. They are all empty. Any idea how should I debug this 
further


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7877] Add record position to record index metadata payload [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11467:
URL: https://github.com/apache/hudi/pull/11467#issuecomment-2176339072

   
   ## CI report:
   
   * a7f2ee34e98381ad9afa7c6dfa634aace8b3546b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24450)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Ensure Parquet can interoperate different level structures [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11461:
URL: https://github.com/apache/hudi/pull/11461#issuecomment-2176338861

   
   ## CI report:
   
   * 16afb3d821f1fd35beff26f697016826bcf55491 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24451)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7876] use properties to store log spill map configs for fg reader [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11455:
URL: https://github.com/apache/hudi/pull/11455#issuecomment-2176338743

   
   ## CI report:
   
   * 17dfb2b314e57e251901611d90d35231c701f167 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7876] use properties to store log spill map configs for fg reader [hudi]

2024-06-18 Thread via GitHub


jonvex commented on PR #11455:
URL: https://github.com/apache/hudi/pull/11455#issuecomment-2176309813

   
![image](https://github.com/apache/hudi/assets/26940621/bd4f9a9f-6fce-4c5f-872c-ce08db9654a6)
   CI passing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Caused by: java.lang.ClassNotFoundException: org.apache.hudi.DefaultSource after hudi upgraded to 6.15 [hudi]

2024-06-18 Thread via GitHub


ROOBALJINDAL commented on issue #11469:
URL: https://github.com/apache/hudi/issues/11469#issuecomment-2176309379

   @soumilshah1995 thanks for replying. I know how to use streamer in emr 
serverless, dont need tutorial. Can you please help me regariding this 
particular exception?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Fix Hudi being able to read 2-level structure [hudi]

2024-06-18 Thread via GitHub


VitoMakarevich commented on PR #11450:
URL: https://github.com/apache/hudi/pull/11450#issuecomment-2176266302

   > Also I see that this block
   > 
   > ```
   > if (!legacyMode) {
   >   requestedSchema = new MessageType(requestedSchema.getName(), 
convertLegacyMap(requestedSchema.getFields()));
   > }
   > ```
   > 
   > is redundant - since in all the cases `requestedSchema` fetched after 
tuning AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE equals to schema coming from 
this block - meaning AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE does the correct 
conversion. So this code block should be removed to not cause confusion
   
   Sorry -this is incorrect, it looks like Spark 3.1 fails without this block.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Fix Hudi being able to read 2-level structure [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11450:
URL: https://github.com/apache/hudi/pull/11450#issuecomment-2176234693

   
   ## CI report:
   
   * 5f876303cc2ed8f203f8db9f3dea972e3a28f0b7 UNKNOWN
   * c814458b48d8a33a2b5ebbb0355183e129be89f4 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24449)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11465:
URL: https://github.com/apache/hudi/pull/11465#issuecomment-2176195199

   
   ## CI report:
   
   * 529cf1aad669ec04f018d0ad0f176e7aebd42bf7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24447)
 
   * 753044eb444c4f022f7bd5045a2e6df8d7dda4b0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24452)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Ensure Parquet can interoperate different level structures [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11461:
URL: https://github.com/apache/hudi/pull/11461#issuecomment-2176195058

   
   ## CI report:
   
   * d538bb2c8d4ba5a8da23034338b080f33d132888 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24446)
 
   * 16afb3d821f1fd35beff26f697016826bcf55491 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24451)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Fix Hudi being able to read 2-level structure [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11450:
URL: https://github.com/apache/hudi/pull/11450#issuecomment-2176194891

   
   ## CI report:
   
   * 5f876303cc2ed8f203f8db9f3dea972e3a28f0b7 UNKNOWN
   * 0dbffbf92f2cb18861621be3e216e65a03129cf6 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24445)
 
   * c814458b48d8a33a2b5ebbb0355183e129be89f4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24449)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11465:
URL: https://github.com/apache/hudi/pull/11465#issuecomment-2176093947

   
   ## CI report:
   
   * 529cf1aad669ec04f018d0ad0f176e7aebd42bf7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24447)
 
   * 753044eb444c4f022f7bd5045a2e6df8d7dda4b0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Fix Hudi being able to read 2-level structure [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11450:
URL: https://github.com/apache/hudi/pull/11450#issuecomment-2176093552

   
   ## CI report:
   
   * 5f876303cc2ed8f203f8db9f3dea972e3a28f0b7 UNKNOWN
   * 0dbffbf92f2cb18861621be3e216e65a03129cf6 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24445)
 
   * c814458b48d8a33a2b5ebbb0355183e129be89f4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7877] Add record position to record index metadata payload [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11467:
URL: https://github.com/apache/hudi/pull/11467#issuecomment-2176094037

   
   ## CI report:
   
   * df180b77664cb45e434dec4982d31ff70e3dac3c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24448)
 
   * a7f2ee34e98381ad9afa7c6dfa634aace8b3546b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24450)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Ensure Parquet can interoperate different level structures [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11461:
URL: https://github.com/apache/hudi/pull/11461#issuecomment-2176093779

   
   ## CI report:
   
   * d538bb2c8d4ba5a8da23034338b080f33d132888 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24446)
 
   * 16afb3d821f1fd35beff26f697016826bcf55491 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7877] Add record position to record index metadata payload [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11467:
URL: https://github.com/apache/hudi/pull/11467#issuecomment-2176075195

   
   ## CI report:
   
   * df180b77664cb45e434dec4982d31ff70e3dac3c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24448)
 
   * a7f2ee34e98381ad9afa7c6dfa634aace8b3546b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Caused by: java.lang.ClassNotFoundException: org.apache.hudi.DefaultSource after hudi upgraded to 6.15 [hudi]

2024-06-18 Thread via GitHub


soumilshah1995 commented on issue #11469:
URL: https://github.com/apache/hudi/issues/11469#issuecomment-2176057015

   EMR here is guide to follow 
   
   https://youtu.be/jvbHUl9A4tQ?si=l7AdUR4vmr_5sDIq
   
   
   Running Apache Hudi Delta Streamer On EMR Serverless Hands on Lab step by 
step guide for beginners
   
   
![1](https://user-images.githubusercontent.com/39345855/229940404-f3efeaae-6e5b-446b-a229-b1fb86e4ea2b.JPG)
   
   ## Video based guide 
* https://www.youtube.com/watch?v=jvbHUl9A4tQ&feature=youtu.be

   
   # Steps 
   ## Step 1: Download the sample Parquet files from the links 
   * 
https://drive.google.com/drive/folders/1BwNEK649hErbsWcYLZhqCWnaXFX3mIsg?usp=share_link
    Uplaod to S3 Folder as shown in diagram 
   
![image](https://user-images.githubusercontent.com/39345855/229939875-6f2f22ae-c792-4904-bcf8-b1e53ce1e122.png)
   
   
   
   ## Step 2: Start EMR Serverless Cluster 
   
![image](https://user-images.githubusercontent.com/39345855/229940052-29f6e2a8-9568-4100-8a1b-e988c405f505.png)
   
![image](https://user-images.githubusercontent.com/39345855/229940099-cf002f04-18f8-4d26-8d89-d512e96bef76.png)
   
![image](https://user-images.githubusercontent.com/39345855/229940131-836414cf-a85f-4b9f-b1d6-c36115d335c2.png)
   
   # Step 3 Run Python Code to submit Job 
   * Please change nd edit the varibales 
   
   ```
   try:
   import json
   import uuid
   import os
   import boto3
   from dotenv import load_dotenv
   
   load_dotenv(".env")
   except Exception as e:
   pass
   
   global AWS_ACCESS_KEY
   global AWS_SECRET_KEY
   global AWS_REGION_NAME
   
   AWS_ACCESS_KEY = os.getenv("DEV_ACCESS_KEY")
   AWS_SECRET_KEY = os.getenv("DEV_SECRET_KEY")
   AWS_REGION_NAME = os.getenv("DEV_REGION")
   
   client = boto3.client("emr-serverless",
 aws_access_key_id=AWS_ACCESS_KEY,
 aws_secret_access_key=AWS_SECRET_KEY,
 region_name=AWS_REGION_NAME)
   
   
   def lambda_handler_test_emr(event, context):
   # --Hudi settings 
-
   glue_db = "hudi_db"
   table_name = "invoice"
   op = "UPSERT"
   table_type = "COPY_ON_WRITE"
   
   record_key = 'invoiceid'
   precombine = "replicadmstimestamp"
   partition_feild = 'destinationstate'
   source_ordering_field = 'replicadmstimestamp'
   
   delta_streamer_source = 's3:///raw'
   hudi_target_path = 's3://X/hudi'
   
   # 
-
   #   EMR
   # 

   ApplicationId = "XXX"
   ExecutionTime = 600
   ExecutionArn = "XX"
   JobName = 'delta_streamer_{}'.format(table_name)
   
   # 

   
   spark_submit_parameters = ' --conf 
spark.jars=/usr/lib/hudi/hudi-utilities-bundle.jar'
   spark_submit_parameters += ' --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer'
   spark_submit_parameters += ' --conf 
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
   spark_submit_parameters += ' --conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
   spark_submit_parameters += ' --conf 
spark.sql.hive.convertMetastoreParquet=false'
   spark_submit_parameters += ' --conf 
mapreduce.fileoutputcommitter.marksuccessfuljobs=false'
   spark_submit_parameters += ' --conf 
spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'
   spark_submit_parameters += ' --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer'
   
   arguments = [
   "--table-type", table_type,
   "--op", op,
   "--enable-sync",
   "--source-ordering-field", source_ordering_field,
   "--source-class", 
"org.apache.hudi.utilities.sources.ParquetDFSSource",
   "--target-table", table_name,
   "--target-base-path", hudi_target_path,
   "--payload-class", "org.apache.hudi.common.model.AWSDmsAvroPayload",
   "--hoodie-conf", 
"hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator",
   "--hoodie-conf", 
"hoodie.datasource.write.recordkey.field={}".format(record_key),
   "--hoodie-conf", 
"hoodie.datasource.write.partitionpath.field={}".format(partition_feild),
   "--hoodie-conf", 
"hoodie.deltastreamer.source.dfs.root={}".format(delta_streamer_source),
   "--hoodie-conf", 
"hoodie.datasource.write.precombine.field={}".format(precombine),
   "--hoodie-conf", "hoodie.database.name={}".format(glue_db),
   "--

Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11465:
URL: https://github.com/apache/hudi/pull/11465#issuecomment-2176056872

   
   ## CI report:
   
   * 529cf1aad669ec04f018d0ad0f176e7aebd42bf7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24447)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Data deduplication caused by drawback in the delete invalid files before commit [hudi]

2024-06-18 Thread via GitHub


nsivabalan commented on issue #11419:
URL: https://github.com/apache/hudi/issues/11419#issuecomment-2176054381

   is the main reason, diff file system schemes treat file not found 
differently during fs.delete()? and you are proposing HoodieStorage#deleteFile 
to unify that? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7877] Add record position to record index metadata payload [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11467:
URL: https://github.com/apache/hudi/pull/11467#issuecomment-2175969354

   
   ## CI report:
   
   * df180b77664cb45e434dec4982d31ff70e3dac3c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24448)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Caused by: java.lang.ClassNotFoundException: org.apache.hudi.DefaultSource after hudi upgraded to 6.15 [hudi]

2024-06-18 Thread via GitHub


ROOBALJINDAL opened a new issue, #11469:
URL: https://github.com/apache/hudi/issues/11469

   **Describe the problem you faced**
   We are creating empty hudi tables from java as follows
   ```
   Dataset emptyDF = spark.createDataFrame(new ArrayList(), 
schemaStruct);
   emptyDF.write()
   .format("org.apache.hudi")
   .options(tableConf.getHudiOptions())
   .mode(SaveMode.Append)
   .save();
   ```
   Spark conf:
   ```
   entryPoint: /hudi/hudi-addon-edfx.jar
   sparkParamsArguments = ["--class 
com.edifecs.em.cloud.hudi.setup.PreCreateEmptyTablesInHudi",
   "--conf spark.jars=/usr/lib/hudi/hudi-utilities-bundle.jar",
   "--conf spark.executor.instances=0",
   "--conf spark.executor.memory=4g",
   "--conf spark.driver.memory=4g",
   "--conf spark.driver.cores=4",
   "--conf spark.dynamicAllocation.initialExecutors=1"
   ```
   This used to work fine but suddenly stopped working after hudi upgraded from 
13.1 to 14.0 (Emr upgraded from 6.12 to 6.15)
   
   I refered to similar issue: [](https://github.com/apache/hudi/issues/2997)
   I also added hudi-spark3-bundle_2.12-0.14.0.jar to the spark.jars but it 
didnt work. Dont know why it is not able to find this class. 
   
   **Environment Description**
   
   * Hudi version : 14.0
   
   * AWS EMR version : 6.15
   
   
   **Stacktrace**
   
   ```24/06/18 12:02:18 ERROR PreCreateEmptyTablesInHudi: Exception encountered 
while generating table ehcpencountererror : 
   org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed 
to find the data source: org.apache.hudi. Please find packages at 
`https://spark.apache.org/third-party-projects.html`.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.dataSourceNotFoundError(QueryExecutionErrors.scala:739)
 ~[spark-catalyst_2.12-3.4.1-amzn-2.jar:3.4.1-amzn-2]
at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:647)
 ~[spark-sql_2.12-3.4.1-amzn-2.jar:3.4.1-amzn-2]
at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:697)
 ~[spark-sql_2.12-3.4.1-amzn-2.jar:3.4.1-amzn-2]
at 
org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:860)
 ~[spark-sql_2.12-3.4.1-amzn-2.jar:3.4.1-amzn-2]
at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256) 
~[spark-sql_2.12-3.4.1-amzn-2.jar:3.4.1-amzn-2]
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:247) 
~[spark-sql_2.12-3.4.1-amzn-2.jar:3.4.1-amzn-2]
at 
com.edifecs.em.cloud.hudi.setup.PreCreateEmptyTablesInHudi.lambda$main$0(PreCreateEmptyTablesInHudi.java:170)
 ~[?:?]
at 
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) ~[?:?]
at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625) 
~[?:?]
at 
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290) 
~[?:?]
at 
java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:754) ~[?:?]
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373) 
~[?:?]
at 
java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
 ~[?:?]
at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655) ~[?:?]
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) 
~[?:?]
at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) 
~[?:?]
   Caused by: java.lang.ClassNotFoundException: org.apache.hudi.DefaultSource
at 
jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641) 
~[?:?]
at 
jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
 ~[?:?]
at java.lang.ClassLoader.loadClass(ClassLoader.java:525) ~[?:?]
at 
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:633)
 ~[spark-sql_2.12-3.4.1-amzn-2.jar:3.4.1-amzn-2]
at scala.util.Try$.apply(Try.scala:213) ~[scala-library-2.12.15.jar:?]
at 
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:633)
 ~[spark-sql_2.12-3.4.1-amzn-2.jar:3.4.1-amzn-2]
at scala.util.Failure.orElse(Try.scala:224) 
~[scala-library-2.12.15.jar:?]
at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:633)
 ~[spark-sql_2.12-3.4.1-amzn-2.jar:3.4.1-amzn-2]
... 15 more```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: c

Re: [PR] [HUDI-7877] Add record position to record index metadata payload [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11467:
URL: https://github.com/apache/hudi/pull/11467#issuecomment-2175953221

   
   ## CI report:
   
   * df180b77664cb45e434dec4982d31ff70e3dac3c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] SqlQueryBasedTransformer new field issue with PostgresDebeziumSource [hudi]

2024-06-18 Thread via GitHub


soumilshah1995 commented on issue #11468:
URL: https://github.com/apache/hudi/issues/11468#issuecomment-2175922997

   slack chat Thread 
https://apache-hudi.slack.com/archives/C4D716NPQ/p1718691646054409
   
   this issue arises when attempting to use transformer with 
   --source-class 
org.apache.hudi.utilities.sources.debezium.PostgresDebeziumSource \
   
   --transformer-class 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
   --hoodie-conf 'hoodie.deltastreamer.transformer.sql=SELECT * FROM '
   
   Throws an issue even when using Select *
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] [hudi]

2024-06-18 Thread via GitHub


ashwinagalcha-ps opened a new issue, #11468:
URL: https://github.com/apache/hudi/issues/11468

   When using Kafka + Debezium + Streamer, we are able to write data and the 
job works fine, but when using the SqlQueryBasedTransformer, it is able to 
write data on S3 with the new field but ultimately the job fails.
   
   Below are the Hudi Deltastreamer job configs:
   
   ```"--table-type", "COPY_ON_WRITE",  
   "--source-class", 
"org.apache.hudi.utilities.sources.debezium.PostgresDebeziumSource",
   "--transformer-class", 
"org.apache.hudi.utilities.transform.SqlQueryBasedTransformer",
   "--hoodie-conf", "hoodie.streamer.transformer.sql=SELECT *, extract(year 
from a.created_at) as year FROM  a",
   "--source-ordering-field", output["source_ordering_field"], 
   "--target-base-path", 
f"s3a://{env_params['deltastreamer_bucket']}/{db_name}/{schema}/{output['table_name']}/",
  
   "--target-table", output["table_name"],  
   "--auto.offset.reset=earliest
   "--props", properties_file,  
   "--payload-class", 
"org.apache.hudi.common.model.debezium.PostgresDebeziumAvroPayload",
   "--enable-hive-sync",  
   "--hoodie-conf", "hoodie.datasource.hive_sync.mode=hms",
   "--hoodie-conf", 
"hoodie.datasource.write.schema.allow.auto.evolution.column.drop=true",
   "--hoodie-conf", 
f"hoodie.deltastreamer.source.kafka.topic={connector_name}.{schema}.{output['table_name']}",
   "--hoodie-conf", f"schema.registry.url={env_params['schema_registry_url']}",
   "--hoodie-conf", 
f"hoodie.deltastreamer.schemaprovider.registry.url={env_params['schema_registry_url']}/subjects/{connector_name}.{schema}.{output['table_name']}-value/versions/latest",
   "--hoodie-conf", 
"hoodie.deltastreamer.source.kafka.value.deserializer.class=io.confluent.kafka.serializers.KafkaAvroDeserializer",
   "--hoodie-conf", "hoodie.datasource.hive_sync.use_jdbc=false",  
   "--hoodie-conf", 
f"hoodie.datasource.hive_sync.database={output['hive_database']}",  
   "--hoodie-conf", 
f"hoodie.datasource.hive_sync.table={output['table_name']}",  
   "--hoodie-conf", "hoodie.datasource.hive_sync.metastore.uris=", 
   "--hoodie-conf", "hoodie.datasource.hive_sync.enable=true",  
   "--hoodie-conf", "hoodie.datasource.hive_sync.support_timestamp=true", 
   "--hoodie-conf", "hoodie.deltastreamer.source.kafka.maxEvents=10",
   "--hoodie-conf", 
f"hoodie.datasource.write.recordkey.field={output['record_key']}", 
   "--hoodie-conf", 
f"hoodie.datasource.write.precombine.field={output['precombine_field']}",
   "--hoodie-conf", 
f"hoodie.datasource.hive_sync.glue_database={output['hive_database']}",
   "--continuous"```
   
   Properties file:
   ```bootstrap.servers=
   auto.offset.reset=earliest
   schema.registry.url=http://host:8081```
   
   **Expected behavior**: To be able to extract a new field (year) in the 
target hudi table with the help of SqlQueryBasedTransformer.
   
   **Environment Description**
   
   * Hudi version : 0.14.0
   
   * Spark version : 3.4.1
   
   * Hadoop version : 3.3.4
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   * Base image & jars:
   
`public.ecr.aws/ocean-spark/spark:platform-3.4.1-hadoop-3.3.4-java-11-scala-2.12-python-3.10-gen21`
   
   
`https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.4-bundle_2.12/0.14.0/hudi-spark3.4-bundle_2.12-0.14.0.jar
   
https://repo1.maven.org/maven2/org/apache/hudi/hudi-utilities-bundle_2.12/0.14.0/hudi-utilities-bundle_2.12-0.14.0.jar`
   
   **Stacktrace**
   
   ```2024-06-14T14:16:17.562738557Z 24/06/14 14:16:17 ERROR HoodieStreamer: 
Shutting down delta-sync due to exception
   2024-06-14T14:16:17.562785897Z 
org.apache.hudi.utilities.exception.HoodieTransformExecutionException: Failed 
to apply sql query based transformer
   2024-06-14T14:16:17.562797467Z   at 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer.apply(SqlQueryBasedTransformer.java:68)
   2024-06-14T14:16:17.562805097Z   at 
org.apache.hudi.utilities.transform.ChainedTransformer.apply(ChainedTransformer.java:105)
   2024-06-14T14:16:17.562812197Z   at 
org.apache.hudi.utilities.streamer.StreamSync.lambda$fetchFromSource$0(StreamSync.java:530)
   2024-06-14T14:16:17.562819517Z   at 
org.apache.hudi.common.util.Option.map(Option.java:108)
   2024-06-14T14:16:17.562826327Z   at 
org.apache.hudi.utilities.streamer.StreamSync.fetchFromSource(StreamSync.java:530)
   2024-06-14T14:16:17.562836838Z   at 
org.apache.hudi.utilities.streamer.StreamSync.readFromSource(StreamSync.java:495)
   2024-06-14T14:16:17.562844648Z   at 
org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:405)
   2024-06-14T14:16:17.562852958Z   at 
org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.lambda$startService$1(HoodieStreamer.java:757)
   2024-06-14T14:16:17.562860358Z   at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
   2024-06-14T14:16:17.562868059Z   at 
j

[jira] [Updated] (HUDI-7877) Add record position to record index metadata payload

2024-06-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7877:
-
Labels: pull-request-available  (was: )

> Add record position to record index metadata payload
> 
>
> Key: HUDI-7877
> URL: https://issues.apache.org/jira/browse/HUDI-7877
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> RLI should save the record position so that can be used in the index lookup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7877] Add record position to record index metadata payload [hudi]

2024-06-18 Thread via GitHub


lokeshj1703 opened a new pull request, #11467:
URL: https://github.com/apache/hudi/pull/11467

   ### Change Logs
   
   RLI should save the record position so that can be used in the index lookup. 
This PR adds a position field in the RLI metadata to track the same.
   
   ### Impact
   
   NA
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   NA
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Ensure Parquet can interoperate different level structures [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11461:
URL: https://github.com/apache/hudi/pull/11461#issuecomment-2175829345

   
   ## CI report:
   
   * d538bb2c8d4ba5a8da23034338b080f33d132888 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24446)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Fix Hudi being able to read 2-level structure [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11450:
URL: https://github.com/apache/hudi/pull/11450#issuecomment-2175829244

   
   ## CI report:
   
   * 5f876303cc2ed8f203f8db9f3dea972e3a28f0b7 UNKNOWN
   * 0dbffbf92f2cb18861621be3e216e65a03129cf6 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24445)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Fix Hudi being able to read 2-level structure [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11450:
URL: https://github.com/apache/hudi/pull/11450#issuecomment-2175726334

   
   ## CI report:
   
   * 5f876303cc2ed8f203f8db9f3dea972e3a28f0b7 UNKNOWN
   * 93ddf4151fedb0698e0c5c56d69b4b866626d393 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24440)
 
   * 0dbffbf92f2cb18861621be3e216e65a03129cf6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24445)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11465:
URL: https://github.com/apache/hudi/pull/11465#issuecomment-2175726559

   
   ## CI report:
   
   * 6d56ddcb8eae62dcd16b180616269a59afa9df28 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24441)
 
   * 4e0ba2e3bc68df9bb6b0de2a50f60ba86fa68508 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2)
 
   * 529cf1aad669ec04f018d0ad0f176e7aebd42bf7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24447)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Ensure Parquet can interoperate different level structures [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11461:
URL: https://github.com/apache/hudi/pull/11461#issuecomment-2175726457

   
   ## CI report:
   
   * d538bb2c8d4ba5a8da23034338b080f33d132888 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24446)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Fix Hudi being able to read 2-level structure [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11450:
URL: https://github.com/apache/hudi/pull/11450#issuecomment-2175709698

   
   ## CI report:
   
   * 5f876303cc2ed8f203f8db9f3dea972e3a28f0b7 UNKNOWN
   * 93ddf4151fedb0698e0c5c56d69b4b866626d393 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24440)
 
   * 0dbffbf92f2cb18861621be3e216e65a03129cf6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11465:
URL: https://github.com/apache/hudi/pull/11465#issuecomment-2175710030

   
   ## CI report:
   
   * 6d56ddcb8eae62dcd16b180616269a59afa9df28 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24441)
 
   * 4e0ba2e3bc68df9bb6b0de2a50f60ba86fa68508 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2)
 
   * 529cf1aad669ec04f018d0ad0f176e7aebd42bf7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Ensure Parquet can interoperate different level structures [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11461:
URL: https://github.com/apache/hudi/pull/11461#issuecomment-2175709905

   
   ## CI report:
   
   * d538bb2c8d4ba5a8da23034338b080f33d132888 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-4096) Sync timeline from embedded timeline server in flink pipline

2024-06-18 Thread Qijun Fu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855845#comment-17855845
 ] 

Qijun Fu commented on HUDI-4096:


I think this [pr |https://github.com/apache/hudi/pull/9651] have already fixed 
this? 

> Sync timeline from embedded timeline server in flink pipline
> 
>
> Key: HUDI-4096
> URL: https://issues.apache.org/jira/browse/HUDI-4096
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink
>Reporter: sivabalan narayanan
>Assignee: Danny Chen
>Priority: Major
>
> At present, in the Flink-Hudi pipeline, each task will scan the meta 
> directory to obtain the latest timeline, which will cause frequent get 
> listing operations on HDFS and cause a lot of pressure.
> A proposal is we can modify the way to get the timeline in the Flink-Hudi 
> pipeline and pull the active timeline through the embedded timeline server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7892] Building workload support set parallelism [hudi]

2024-06-18 Thread via GitHub


danny0405 commented on code in PR #11463:
URL: https://github.com/apache/hudi/pull/11463#discussion_r1644107676


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java:
##
@@ -157,6 +157,9 @@ public HoodieWriteMetadata> 
execute(HoodieData> inputRecordsWithClusteringUpdate = 
clusteringHandleUpdate(inputRecords);
+if (config.getBuildWorkloadParallelism() > 0) {
+  inputRecordsWithClusteringUpdate = 
inputRecordsWithClusteringUpdate.repartition(config.getBuildWorkloadParallelism());

Review Comment:
   > and this is an existing question in other logical, add new param is not 
friendly to user, I have a pr will optimizing the inference problem of 
partitions and it can also solve your problem
   
   This would be nice, the fix looks promising.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11465:
URL: https://github.com/apache/hudi/pull/11465#issuecomment-2175617946

   
   ## CI report:
   
   * 6d56ddcb8eae62dcd16b180616269a59afa9df28 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24441)
 
   * 4e0ba2e3bc68df9bb6b0de2a50f60ba86fa68508 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Fail to add default partition [hudi]

2024-06-18 Thread via GitHub


danny0405 commented on issue #10154:
URL: https://github.com/apache/hudi/issues/10154#issuecomment-2175570868

   Thanks for the feedback @CaesarWangX , do you try HMS as the sync mode then, 
the 1st is unexpected and should be a bug, the motive is to keep sync with Hive 
for default partition name, but now it causes problems reported by Hive.
   
   For the 2nd, there might be no easy way to be compatible with history data 
set because partition path is hotspot code path and we might not consider the 
ramifications for history values for each record. If you uses the Flink for 
ingestion, there is config option named `partition.default_name` to switch to 
other default value as needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Fix Hudi being able to read 2-level structure [hudi]

2024-06-18 Thread via GitHub


VitoMakarevich commented on PR #11450:
URL: https://github.com/apache/hudi/pull/11450#issuecomment-2175605541

   Also I see that this block
   ```
   if (!legacyMode) {
 requestedSchema = new MessageType(requestedSchema.getName(), 
convertLegacyMap(requestedSchema.getFields()));
   }
   ```
   is redundant - since in all the cases `requestedSchema` fetched after tuning 
AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE equals to schema coming from this 
block - meaning AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE does the correct 
conversion.
   So this code block should be removed to not cause confusion


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11465:
URL: https://github.com/apache/hudi/pull/11465#issuecomment-2175600718

   
   ## CI report:
   
   * 6d56ddcb8eae62dcd16b180616269a59afa9df28 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24441)
 
   * 4e0ba2e3bc68df9bb6b0de2a50f60ba86fa68508 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Caused by: org.apache.hudi.exception.HoodieException: Executor executes action [commits the instant 20240618064120870] error [hudi]

2024-06-18 Thread via GitHub


danny0405 commented on issue #11466:
URL: https://github.com/apache/hudi/issues/11466#issuecomment-2175579240

   I see you put the option `options.put(FlinkOptions.IGNORE_FAILED.key(), 
"false");`, it looks like there is error for parquet writers which is collected 
back to the coordinator, so it reports error when committing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() // to branch-0.x [hudi]

2024-06-18 Thread via GitHub


hudi-bot commented on PR #11437:
URL: https://github.com/apache/hudi/pull/11437#issuecomment-2175474243

   
   ## CI report:
   
   * 520894319e26fca9b1b28a513be1273aba13edb9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24443)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-6286) Overwrite mode should not delete old data

2024-06-18 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855820#comment-17855820
 ] 

Geser Dugarov commented on HUDI-6286:
-

Note that in HoodieWriteUtils.validateTableConfig() we skip all conflicts check 
between new and existing table configurations if it's an Overwrite save mode.

> Overwrite mode should not delete old data
> -
>
> Key: HUDI-6286
> URL: https://issues.apache.org/jira/browse/HUDI-6286
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
> Fix For: 1.1.0
>
>
> https://github.com/apache/hudi/pull/8076/files#r1127283648
> For *Overwrite* mode, we should not delete the basePath. Just overwrite the 
> existing data



--
This message was sent by Atlassian Jira
(v8.20.10#820010)