date:20240607

[jira] [Updated] (HUDI-7844) Fix HoodieSparkSqlTestBase to throw error upon test failure

2024-06-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7844:
-
Labels: pull-request-available  (was: )

> Fix HoodieSparkSqlTestBase to throw error upon test failure
> ---
>
> Key: HUDI-7844
> URL: https://issues.apache.org/jira/browse/HUDI-7844
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
> Attachments: Screenshot 2024-06-07 at 22.27.21.png
>
>
> This PR ([https://github.com/apache/hudi/pull/11162]) introduces the 
> following changes that make `HoodieSparkSqlTestBase` to swallow test failures.
>  
> !Screenshot 2024-06-07 at 22.27.21.png|width=873,height=397!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] [HUDI-7844] Fix HoodieSparkSqlTestBase to throw error upon test failure [hudi]

2024-06-07 Thread via GitHub



yihua opened a new pull request, #11416:
URL: https://github.com/apache/hudi/pull/11416

   ### Change Logs
   
   PR #11162 introduces the changes that make `HoodieSparkSqlTestBase` to 
swallow test failures.  This PR reverts the changes so that test failures are 
surfaced locally and in CI.
   
   ### Impact
   
   Makes sure test failures are surfaced in CI.
   
   ### Risk level
   
   none
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (HUDI-7844) Fix HoodieSparkSqlTestBase to throw error upon test failure

2024-06-07 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7844:
---

Assignee: Ethan Guo

> Fix HoodieSparkSqlTestBase to throw error upon test failure
> ---
>
> Key: HUDI-7844
> URL: https://issues.apache.org/jira/browse/HUDI-7844
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: Screenshot 2024-06-07 at 22.27.21.png
>
>
> This PR ([https://github.com/apache/hudi/pull/11162]) introduces the 
> following changes that make `HoodieSparkSqlTestBase` to swallow test failures.
>  
> !Screenshot 2024-06-07 at 22.27.21.png|width=873,height=397!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7844) Fix HoodieSparkSqlTestBase to throw error upon test failure

2024-06-07 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7844:

Description: 
This PR ([https://github.com/apache/hudi/pull/11162]) introduces the following 
changes that makes 

!Screenshot 2024-06-07 at 22.27.21.png|width=873,height=397!

 

  was:
This PR (https://github.com/apache/hudi/pull/11162) introduces the following 
changes in 

 

 


> Fix HoodieSparkSqlTestBase to throw error upon test failure
> ---
>
> Key: HUDI-7844
> URL: https://issues.apache.org/jira/browse/HUDI-7844
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: Screenshot 2024-06-07 at 22.27.21.png
>
>
> This PR ([https://github.com/apache/hudi/pull/11162]) introduces the 
> following changes that makes 
> !Screenshot 2024-06-07 at 22.27.21.png|width=873,height=397!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7844) Fix HoodieSparkSqlTestBase to throw error upon test failure

2024-06-07 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7844:

Description: 
This PR ([https://github.com/apache/hudi/pull/11162]) introduces the following 
changes that make `HoodieSparkSqlTestBase` to swallow test failures.

 

!Screenshot 2024-06-07 at 22.27.21.png|width=873,height=397!

 

  was:
This PR ([https://github.com/apache/hudi/pull/11162]) introduces the following 
changes that makes 

!Screenshot 2024-06-07 at 22.27.21.png|width=873,height=397!

 


> Fix HoodieSparkSqlTestBase to throw error upon test failure
> ---
>
> Key: HUDI-7844
> URL: https://issues.apache.org/jira/browse/HUDI-7844
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: Screenshot 2024-06-07 at 22.27.21.png
>
>
> This PR ([https://github.com/apache/hudi/pull/11162]) introduces the 
> following changes that make `HoodieSparkSqlTestBase` to swallow test failures.
>  
> !Screenshot 2024-06-07 at 22.27.21.png|width=873,height=397!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #9894:
URL: https://github.com/apache/hudi/pull/9894#issuecomment-2155818031

   
   ## CI report:
   
   * a70cfc6db41a781bb3b6c9c8a9138892f7a12687 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24291)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-7844) Fix HoodieSparkSqlTestBase to throw error upon test failure

2024-06-07 Thread Ethan Guo (Jira)

Ethan Guo created HUDI-7844:
---

 Summary: Fix HoodieSparkSqlTestBase to throw error upon test 
failure
 Key: HUDI-7844
 URL: https://issues.apache.org/jira/browse/HUDI-7844
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Ethan Guo
 Attachments: Screenshot 2024-06-07 at 22.27.21.png





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7844) Fix HoodieSparkSqlTestBase to throw error upon test failure

2024-06-07 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7844:

Fix Version/s: 1.0.0

> Fix HoodieSparkSqlTestBase to throw error upon test failure
> ---
>
> Key: HUDI-7844
> URL: https://issues.apache.org/jira/browse/HUDI-7844
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: Screenshot 2024-06-07 at 22.27.21.png
>
>
> This PR (https://github.com/apache/hudi/pull/11162) introduces the following 
> changes in 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7844) Fix HoodieSparkSqlTestBase to throw error upon test failure

2024-06-07 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7844:

Description: 
This PR (https://github.com/apache/hudi/pull/11162) introduces the following 
changes in 

 

 

> Fix HoodieSparkSqlTestBase to throw error upon test failure
> ---
>
> Key: HUDI-7844
> URL: https://issues.apache.org/jira/browse/HUDI-7844
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
> Attachments: Screenshot 2024-06-07 at 22.27.21.png
>
>
> This PR (https://github.com/apache/hudi/pull/11162) introduces the following 
> changes in 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7844) Fix HoodieSparkSqlTestBase to throw error upon test failure

2024-06-07 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7844:

Attachment: Screenshot 2024-06-07 at 22.27.21.png

> Fix HoodieSparkSqlTestBase to throw error upon test failure
> ---
>
> Key: HUDI-7844
> URL: https://issues.apache.org/jira/browse/HUDI-7844
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: Screenshot 2024-06-07 at 22.27.21.png
>
>
> This PR (https://github.com/apache/hudi/pull/11162) introduces the following 
> changes in 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7843) Support record merge mode with partial updates

2024-06-07 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7843:

Description: Right now the new partial update support with partial updates 
stored in the log block works in Spark SQL MERGE INTO and the merging logic 
either based on the transaction time or event time is fully taken care inside 
"HoodieSparkRecordMerger".  It would be good to decouple the merging logic from 
the merger and use the record merge mode to control how to merge partial 
updates.  (was: Right now the new partial update support with partial updates 
stored in the log block works in Spark SQL MERGE INTO and we assume that )

> Support record merge mode with partial updates
> --
>
> Key: HUDI-7843
> URL: https://issues.apache.org/jira/browse/HUDI-7843
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>
> Right now the new partial update support with partial updates stored in the 
> log block works in Spark SQL MERGE INTO and the merging logic either based on 
> the transaction time or event time is fully taken care inside 
> "HoodieSparkRecordMerger".  It would be good to decouple the merging logic 
> from the merger and use the record merge mode to control how to merge partial 
> updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7843) Support record merge mode with partial updates

2024-06-07 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7843:

Fix Version/s: 1.0.0

> Support record merge mode with partial updates
> --
>
> Key: HUDI-7843
> URL: https://issues.apache.org/jira/browse/HUDI-7843
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>
> Right now the new partial update support with partial updates stored in the 
> log block works in Spark SQL MERGE INTO and we assume that 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7843) Support record merge mode with partial updates

2024-06-07 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7843:

Description: Right now the new partial update support with partial updates 
stored in the log block works in Spark SQL MERGE INTO and we assume that 

> Support record merge mode with partial updates
> --
>
> Key: HUDI-7843
> URL: https://issues.apache.org/jira/browse/HUDI-7843
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Priority: Major
>
> Right now the new partial update support with partial updates stored in the 
> log block works in Spark SQL MERGE INTO and we assume that 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7843) Support record merge mode with partial updates

2024-06-07 Thread Ethan Guo (Jira)

Ethan Guo created HUDI-7843:
---

 Summary: Support record merge mode with partial updates
 Key: HUDI-7843
 URL: https://issues.apache.org/jira/browse/HUDI-7843
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #9894:
URL: https://github.com/apache/hudi/pull/9894#issuecomment-2155806588

   
   ## CI report:
   
   * c2dec94b442920784b3914cc13b87294e734a477 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24272)
 
   * a70cfc6db41a781bb3b6c9c8a9138892f7a12687 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24291)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #9894:
URL: https://github.com/apache/hudi/pull/9894#issuecomment-2155804768

   
   ## CI report:
   
   * c2dec94b442920784b3914cc13b87294e734a477 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24272)
 
   * a70cfc6db41a781bb3b6c9c8a9138892f7a12687 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7842) Update docs with the new record merge mode config

2024-06-07 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7842:

Fix Version/s: 1.0.0

> Update docs with the new record merge mode config
> -
>
> Key: HUDI-7842
> URL: https://issues.apache.org/jira/browse/HUDI-7842
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>
> We should educate users on the new record merge mode config introduced by 
> HUDI-6798 that simplifies configs controlling the merging behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7842) Update docs with the new record merge mode config

2024-06-07 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7842:

Description: We should educate users on the new record merge mode config 
introduced by HUDI-6798 that simplifies configs controlling the merging 
behavior.

> Update docs with the new record merge mode config
> -
>
> Key: HUDI-7842
> URL: https://issues.apache.org/jira/browse/HUDI-7842
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Priority: Major
>
> We should educate users on the new record merge mode config introduced by 
> HUDI-6798 that simplifies configs controlling the merging behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7842) Update docs with the new record merge mode config

2024-06-07 Thread Ethan Guo (Jira)

Ethan Guo created HUDI-7842:
---

 Summary: Update docs with the new record merge mode config
 Key: HUDI-7842
 URL: https://issues.apache.org/jira/browse/HUDI-7842
 Project: Apache Hudi
  Issue Type: Task
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

(hudi) branch branch-0.x updated: [MINOR] use scala.math.abs instead of calcite abs (#11412)

2024-06-07 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/branch-0.x by this push:
 new 041964ab711 [MINOR] use scala.math.abs instead of calcite abs (#11412)
041964ab711 is described below

commit 041964ab71175739134c252bef870693d8bba14e
Author: Shawn Chang <42792772+c...@users.noreply.github.com>
AuthorDate: Fri Jun 7 19:51:53 2024 -0700

[MINOR] use scala.math.abs instead of calcite abs (#11412)

Co-authored-by: Shawn Chang 
---
 .../scala/org/apache/hudi/functional/TestParquetColumnProjection.scala  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestParquetColumnProjection.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestParquetColumnProjection.scala
index 0173c3f642a..c256cf32fb3 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestParquetColumnProjection.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestParquetColumnProjection.scala
@@ -29,7 +29,6 @@ import 
org.apache.hudi.testutils.SparkClientFunctionalTestHarness.getSparkSqlCon
 import org.apache.hudi.{DataSourceReadOptions, DataSourceWriteOptions, 
DefaultSource, HoodieBaseRelation, HoodieSparkUtils, HoodieUnsafeRDD}
 
 import org.apache.avro.Schema
-import org.apache.calcite.runtime.SqlFunctions.abs
 import org.apache.parquet.hadoop.util.counters.BenchmarkCounter
 import org.apache.spark.SparkConf
 import org.apache.spark.internal.Logging
@@ -39,6 +38,7 @@ import org.junit.jupiter.api.Assertions.{assertEquals, 
assertFalse, assertTrue,
 import org.junit.jupiter.api.{Disabled, Tag, Test}
 
 import scala.collection.JavaConverters._
+import scala.math.abs
 
 @Tag("functional")
 class TestParquetColumnProjection extends SparkClientFunctionalTestHarness 
with Logging {

Re: [PR] [MINOR][branch-0.x] Remove calcite dependency [hudi]

2024-06-07 Thread via GitHub



danny0405 merged PR #11412:
URL: https://github.com/apache/hudi/pull/11412


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

(hudi) branch master updated: [HUDI-7834] Create placeholder table versions and introduce new hoodie table property to track initial table version (#11406)

2024-06-07 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new a33b2a5e03f [HUDI-7834] Create placeholder table versions and 
introduce new hoodie table property to track initial table version (#11406)
a33b2a5e03f is described below

commit a33b2a5e03f434e3ce270a128be626ae9e9e78c9
Author: Balaji Varadarajan 
AuthorDate: Fri Jun 7 18:43:33 2024 -0700

[HUDI-7834] Create placeholder table versions and introduce new hoodie 
table property to track initial table version (#11406)

Co-authored-by: Balaji Varadarajan 
Co-authored-by: Y Ethan Guo 
---
 .../apache/hudi/cli/commands/RepairsCommand.java   |  4 +++
 .../upgrade/EightToSevenDowngradeHandler.java  | 37 +++
 .../table/upgrade/SevenToEightUpgradeHandler.java  | 38 
 .../table/upgrade/SevenToSixDowngradeHandler.java  | 40 +
 .../table/upgrade/SixToSevenUpgradeHandler.java| 42 ++
 .../hudi/table/upgrade/UpgradeDowngrade.java   |  8 +
 .../hudi/common/table/HoodieTableConfig.java   | 16 +
 .../hudi/common/table/HoodieTableMetaClient.java   |  3 ++
 .../hudi/common/table/HoodieTableVersion.java  | 10 --
 .../common/table/TestHoodieTableMetaClient.java|  1 +
 .../RepairOverwriteHoodiePropsProcedure.scala  |  5 ++-
 .../sql/hudi/procedure/TestRepairsProcedure.scala  |  1 +
 .../TestUpgradeOrDowngradeProcedure.scala  |  4 +--
 13 files changed, 203 insertions(+), 6 deletions(-)

diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/RepairsCommand.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/RepairsCommand.java
index 57ec8ccf57b..569136e0b50 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/RepairsCommand.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/RepairsCommand.java
@@ -161,6 +161,10 @@ public class RepairsCommand {
   newProps.load(fileInputStream);
 }
 Map oldProps = client.getTableConfig().propsMap();
+// Copy Initial Version from old-props to new-props
+if (oldProps.containsKey(HoodieTableConfig.INITIAL_VERSION.key())) {
+  newProps.put(HoodieTableConfig.INITIAL_VERSION.key(), 
oldProps.get(HoodieTableConfig.INITIAL_VERSION.key()));
+}
 HoodieTableConfig.create(client.getStorage(), client.getMetaPath(), 
newProps);
 // reload new props as checksum would have been added
 newProps =
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/EightToSevenDowngradeHandler.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/EightToSevenDowngradeHandler.java
new file mode 100644
index 000..3bb22481681
--- /dev/null
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/EightToSevenDowngradeHandler.java
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.upgrade;
+
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import java.util.Collections;
+import java.util.Map;
+
+/**
+ * Version 7 is going to be placeholder version for bridge release 0.16.0.
+ * Version 8 is the placeholder version to track 1.x.
+ */
+public class EightToSevenDowngradeHandler implements DowngradeHandler {
+  @Override
+  public Map downgrade(HoodieWriteConfig config, 
HoodieEngineContext context, String instantTime, SupportsUpgradeDowngrade 
upgradeDowngradeHelper) {
+return Collections.emptyMap();
+  }
+}
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SevenToEightUpgradeHandler.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SevenToEightUpgradeHandler.java
new file mode 100644
index 000..9ed4f192786
--- /dev/null
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SevenToEightUpgradeHandler.java
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the

Re: [PR] [HUDI-7834] Create placeholder table versions and introduce new hoodie table property to track initial table version [hudi]

2024-06-07 Thread via GitHub



yihua merged PR #11406:
URL: https://github.com/apache/hudi/pull/11406


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7834] Create placeholder table versions and introduce new hoodie table property to track initial table version [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11406:
URL: https://github.com/apache/hudi/pull/11406#issuecomment-2155749354

   
   ## CI report:
   
   * da08a0b3c0524b46e70a4cbed8ab82eb5f84f24c UNKNOWN
   * 5ff06d980691f473a959e149377e7aa14eaf7a55 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24287)
 
   * 5ff06d9806 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR][branch-0.x] Remove calcite dependency [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11412:
URL: https://github.com/apache/hudi/pull/11412#issuecomment-2155747004

   
   ## CI report:
   
   * f6f9c59cde7928b625332163d3118630fb199c27 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24289)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7834] Create placeholder table versions and introduce new hoodie table property to track initial table version [hudi]

2024-06-07 Thread via GitHub



yihua commented on PR #11406:
URL: https://github.com/apache/hudi/pull/11406#issuecomment-2155725862

   Azure CI is green.
   https://github.com/apache/hudi/assets/2497195/dfb773d1-1753-4bcd-9f05-27037985bf0a;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR][branch-0.x] Remove calcite dependency [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11412:
URL: https://github.com/apache/hudi/pull/11412#issuecomment-2155725721

   
   ## CI report:
   
   * 22f518dc886318f5e5af58765436b353a45c0f21 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24260)
 
   * f6f9c59cde7928b625332163d3118630fb199c27 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24289)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR][branch-0.x] Remove calcite dependency [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11412:
URL: https://github.com/apache/hudi/pull/11412#issuecomment-2155722554

   
   ## CI report:
   
   * 22f518dc886318f5e5af58765436b353a45c0f21 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24260)
 
   * f6f9c59cde7928b625332163d3118630fb199c27 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

(hudi) branch master updated: [HUDI-7840] Add position merging to the new file group reader (#11413)

2024-06-07 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 0d7cc87d687 [HUDI-7840] Add position merging to the new file group 
reader (#11413)
0d7cc87d687 is described below

commit 0d7cc87d687bd235bac099e481535cb9f223b501
Author: Jon Vexler 
AuthorDate: Fri Jun 7 19:47:42 2024 -0400

[HUDI-7840] Add position merging to the new file group reader (#11413)

Co-authored-by: Jonathan Vexler <=>
Co-authored-by: Sagar Sumit 
---
 .../SparkFileFormatInternalRowReaderContext.scala  | 202 +++
 .../hudi/common/engine/HoodieReaderContext.java|  22 +++
 .../common/table/read/HoodieFileGroupReader.java   |   6 +-
 .../HoodiePositionBasedFileGroupRecordBuffer.java  |   4 +-
 .../read/HoodiePositionBasedSchemaHandler.java |  75 
 ...odieFileGroupReaderBasedParquetFileFormat.scala |   2 +-
 ...stSparkFileFormatInternalRowReaderContext.scala |  72 +++
 ...stHoodiePositionBasedFileGroupRecordBuffer.java | 214 +
 .../functional/TestFiltersInFileGroupReader.java   | 109 +++
 .../read/TestHoodieFileGroupReaderOnSpark.scala|   2 +-
 .../TestSpark35RecordPositionMetadataColumn.scala  | 143 ++
 11 files changed, 812 insertions(+), 39 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala
 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala
index 640f1219fbf..715e2d9a9ab 100644
--- 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala
+++ 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala
@@ -22,10 +22,14 @@ package org.apache.hudi
 import org.apache.avro.Schema
 import org.apache.avro.generic.IndexedRecord
 import org.apache.hadoop.conf.Configuration
+import 
org.apache.hudi.SparkFileFormatInternalRowReaderContext.{filterIsSafeForBootstrap,
 getAppliedRequiredSchema}
+import org.apache.hudi.avro.AvroSchemaUtils
 import org.apache.hudi.common.engine.HoodieReaderContext
 import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.HoodieRecord
+import 
org.apache.hudi.common.table.read.HoodiePositionBasedFileGroupRecordBuffer.ROW_INDEX_TEMPORARY_COLUMN_NAME
 import org.apache.hudi.common.util.ValidationUtils.checkState
-import org.apache.hudi.common.util.collection.{ClosableIterator, 
CloseableMappingIterator}
+import org.apache.hudi.common.util.collection.{CachingIterator, 
ClosableIterator, CloseableMappingIterator}
 import org.apache.hudi.io.storage.{HoodieSparkFileReaderFactory, 
HoodieSparkParquetReader}
 import org.apache.hudi.storage.{HoodieStorage, StorageConfiguration, 
StoragePath}
 import org.apache.hudi.util.CloseableInternalRowIterator
@@ -37,7 +41,7 @@ import 
org.apache.spark.sql.execution.datasources.PartitionedFile
 import org.apache.spark.sql.execution.datasources.parquet.{ParquetFileFormat, 
SparkParquetReader}
 import org.apache.spark.sql.hudi.SparkAdapter
 import org.apache.spark.sql.sources.Filter
-import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.types.{LongType, MetadataBuilder, StructField, 
StructType}
 import org.apache.spark.sql.vectorized.{ColumnVector, ColumnarBatch}
 
 import scala.collection.mutable
@@ -53,12 +57,20 @@ import scala.collection.mutable
  *  not required for reading a file group with only 
log files.
  * @param recordKeyColumn   column name for the recordkey
  * @param filters   spark filters that might be pushed down into the 
reader
+ * @param requiredFilters   filters that are required and should always be 
used, even in merging situations
  */
 class SparkFileFormatInternalRowReaderContext(parquetFileReader: 
SparkParquetReader,
   recordKeyColumn: String,
-  filters: Seq[Filter]) extends 
BaseSparkInternalRowReaderContext {
+  filters: Seq[Filter],
+  requiredFilters: Seq[Filter]) 
extends BaseSparkInternalRowReaderContext {
   lazy val sparkAdapter: SparkAdapter = SparkAdapterSupport.sparkAdapter
+  private lazy val bootstrapSafeFilters: Seq[Filter] = 
filters.filter(filterIsSafeForBootstrap) ++ requiredFilters
   private val deserializerMap: mutable.Map[Schema, HoodieAvroDeserializer] = 
mutable.Map()
+  private lazy val allFilters = filters ++ requiredFilters
+
+  override def supportsParquetRowIndex: Boolean = {
+HoodieSparkUtils.gteqSpark3_5
+  }
 
   override def getFileRecordIterator(filePath: StoragePath,
  start: Long,
@@ -66,6 +78,10 @@ class

Re: [PR] [HUDI-7840] Add position merging to the new file group reader [hudi]

2024-06-07 Thread via GitHub



yihua merged PR #11413:
URL: https://github.com/apache/hudi/pull/11413


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7840] Add position merging to the new file group reader [hudi]

2024-06-07 Thread via GitHub



yihua commented on PR #11413:
URL: https://github.com/apache/hudi/pull/11413#issuecomment-2155701487

   Azure CI is green.
   https://github.com/apache/hudi/assets/2497195/2bf57a53-50c2-4881-b09e-b9d8025c058a;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7834] Create placeholder table versions and introduce new hoodie table property to track initial table version [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11406:
URL: https://github.com/apache/hudi/pull/11406#issuecomment-2155699424

   
   ## CI report:
   
   * da08a0b3c0524b46e70a4cbed8ab82eb5f84f24c UNKNOWN
   * 5ff06d980691f473a959e149377e7aa14eaf7a55 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24287)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7834] Create placeholder table versions and introduce new hoodie table property to track initial table version [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11406:
URL: https://github.com/apache/hudi/pull/11406#issuecomment-2155695586

   
   ## CI report:
   
   * da08a0b3c0524b46e70a4cbed8ab82eb5f84f24c UNKNOWN
   * 901c7f94b1b56ac19867d5d0deab34eb35ebce2c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24276)
 
   * 5ff06d980691f473a959e149377e7aa14eaf7a55 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7840] Add position merging to the new file group reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11413:
URL: https://github.com/apache/hudi/pull/11413#issuecomment-2155691424

   
   ## CI report:
   
   * d581b2726ba5047c9e72396820da81ecf1357266 UNKNOWN
   * 0bf72cfded469e3cc1091827cbe4f2f3c16de830 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24285)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24284)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7840] Add position merging to the new file group reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11413:
URL: https://github.com/apache/hudi/pull/11413#issuecomment-2155661535

   
   ## CI report:
   
   * d581b2726ba5047c9e72396820da81ecf1357266 UNKNOWN
   * 6c15d7a0558284728296d74c9acbc6805230d9a2 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24283)
 
   * 0bf72cfded469e3cc1091827cbe4f2f3c16de830 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7840] Add position merging to the new file group reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11413:
URL: https://github.com/apache/hudi/pull/11413#issuecomment-2155656514

   
   ## CI report:
   
   * d581b2726ba5047c9e72396820da81ecf1357266 UNKNOWN
   * 4d4c0fdc03b72cfb7ad86172a75fcca439e42682 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24281)
 
   * 6c15d7a0558284728296d74c9acbc6805230d9a2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11413:
URL: https://github.com/apache/hudi/pull/11413#issuecomment-2155650961

   
   ## CI report:
   
   * d581b2726ba5047c9e72396820da81ecf1357266 UNKNOWN
   * 1ce1d753818efb6be00c20fb5a8dd141c7c47f00 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24280)
 
   * 4d4c0fdc03b72cfb7ad86172a75fcca439e42682 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-07 Thread via GitHub



yihua commented on code in PR #11413:
URL: https://github.com/apache/hudi/pull/11413#discussion_r1631742303


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala:
##
@@ -116,45 +143,154 @@ class 
SparkFileFormatInternalRowReaderContext(parquetFileReader: SparkParquetRea
  skeletonRequiredSchema: Schema,
  dataFileIterator: 
ClosableIterator[InternalRow],
  dataRequiredSchema: Schema): 
ClosableIterator[InternalRow] = {
-doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]],
-  dataFileIterator.asInstanceOf[ClosableIterator[Any]])
+doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]], 
skeletonRequiredSchema,
+  dataFileIterator.asInstanceOf[ClosableIterator[Any]], dataRequiredSchema)
   }
 
-  protected def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any], 
dataFileIterator: ClosableIterator[Any]): ClosableIterator[InternalRow] = {
-new ClosableIterator[Any] {
-  val combinedRow = new JoinedRow()
+  private def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any],
+   skeletonRequiredSchema: Schema,
+   dataFileIterator: ClosableIterator[Any],
+   dataRequiredSchema: Schema): 
ClosableIterator[InternalRow] = {
+if (supportsPositionField()) {
+  assert(AvroSchemaUtils.containsFieldInSchema(skeletonRequiredSchema, 
ROW_INDEX_TEMPORARY_COLUMN_NAME))
+  assert(AvroSchemaUtils.containsFieldInSchema(dataRequiredSchema, 
ROW_INDEX_TEMPORARY_COLUMN_NAME))
+  val rowIndexColumn = new java.util.HashSet[String]()
+  rowIndexColumn.add(ROW_INDEX_TEMPORARY_COLUMN_NAME)
+  //always remove the row index column from the skeleton because the data 
file will also have the same column
+  val skeletonProjection = projectRecord(skeletonRequiredSchema,
+AvroSchemaUtils.removeFieldsFromSchema(skeletonRequiredSchema, 
rowIndexColumn))
 
-  override def hasNext: Boolean = {
-//If the iterators are out of sync it is probably due to filter 
pushdown
-checkState(dataFileIterator.hasNext == skeletonFileIterator.hasNext,
-  "Bootstrap data-file iterator and skeleton-file iterator have to be 
in-sync!")
-dataFileIterator.hasNext && skeletonFileIterator.hasNext
+  //If we need to do position based merging with log files we will leave 
the row index column at the end
+  val dataProjection = if (getHasLogFiles && getUseRecordPosition) {
+getIdentityProjection
+  } else {
+projectRecord(dataRequiredSchema,
+  AvroSchemaUtils.removeFieldsFromSchema(dataRequiredSchema, 
rowIndexColumn))
   }
 
-  override def next(): Any = {
-(skeletonFileIterator.next(), dataFileIterator.next()) match {
-  case (s: ColumnarBatch, d: ColumnarBatch) =>
-val numCols = s.numCols() + d.numCols()
-val vecs: Array[ColumnVector] = new Array[ColumnVector](numCols)
-for (i <- 0 until numCols) {
-  if (i < s.numCols()) {
-vecs(i) = s.column(i)
+  //Always use internal row for positional merge because

Review Comment:
   So iterating through the rows are still needed for stitching?  The filtering 
may still happen within the parquet page/batch since the page level filtering 
is based on the column stats, if that is what you're talking about.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7693) Allow Vectorized Reading for bootstrap in the new fg reader under some conditions

2024-06-07 Thread Jonathan Vexler (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-7693:
--
Description: Vectorized reading can be used for bootstrap if we don't need 
to do any merging. Additionally, it can be used if no filters are pushed down. 
With row index positions, some pushdown filtering could even be allowed  (was: 
Vectorized reading can be used for bootstrap if we don't need to do any 
merging. Additionally, it can be used if no filters are pushed down.)

> Allow Vectorized Reading for bootstrap in the new fg reader under some 
> conditions
> -
>
> Key: HUDI-7693
> URL: https://issues.apache.org/jira/browse/HUDI-7693
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, spark-sql
>Reporter: Jonathan Vexler
>Priority: Minor
>
> Vectorized reading can be used for bootstrap if we don't need to do any 
> merging. Additionally, it can be used if no filters are pushed down. With row 
> index positions, some pushdown filtering could even be allowed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-07 Thread via GitHub



jonvex commented on code in PR #11413:
URL: https://github.com/apache/hudi/pull/11413#discussion_r1631738403


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala:
##
@@ -116,45 +143,154 @@ class 
SparkFileFormatInternalRowReaderContext(parquetFileReader: SparkParquetRea
  skeletonRequiredSchema: Schema,
  dataFileIterator: 
ClosableIterator[InternalRow],
  dataRequiredSchema: Schema): 
ClosableIterator[InternalRow] = {
-doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]],
-  dataFileIterator.asInstanceOf[ClosableIterator[Any]])
+doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]], 
skeletonRequiredSchema,
+  dataFileIterator.asInstanceOf[ClosableIterator[Any]], dataRequiredSchema)
   }
 
-  protected def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any], 
dataFileIterator: ClosableIterator[Any]): ClosableIterator[InternalRow] = {
-new ClosableIterator[Any] {
-  val combinedRow = new JoinedRow()
+  private def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any],
+   skeletonRequiredSchema: Schema,
+   dataFileIterator: ClosableIterator[Any],
+   dataRequiredSchema: Schema): 
ClosableIterator[InternalRow] = {
+if (supportsPositionField()) {
+  assert(AvroSchemaUtils.containsFieldInSchema(skeletonRequiredSchema, 
ROW_INDEX_TEMPORARY_COLUMN_NAME))
+  assert(AvroSchemaUtils.containsFieldInSchema(dataRequiredSchema, 
ROW_INDEX_TEMPORARY_COLUMN_NAME))
+  val rowIndexColumn = new java.util.HashSet[String]()
+  rowIndexColumn.add(ROW_INDEX_TEMPORARY_COLUMN_NAME)
+  //always remove the row index column from the skeleton because the data 
file will also have the same column
+  val skeletonProjection = projectRecord(skeletonRequiredSchema,
+AvroSchemaUtils.removeFieldsFromSchema(skeletonRequiredSchema, 
rowIndexColumn))
 
-  override def hasNext: Boolean = {
-//If the iterators are out of sync it is probably due to filter 
pushdown
-checkState(dataFileIterator.hasNext == skeletonFileIterator.hasNext,
-  "Bootstrap data-file iterator and skeleton-file iterator have to be 
in-sync!")
-dataFileIterator.hasNext && skeletonFileIterator.hasNext
+  //If we need to do position based merging with log files we will leave 
the row index column at the end
+  val dataProjection = if (getHasLogFiles && getUseRecordPosition) {
+getIdentityProjection
+  } else {
+projectRecord(dataRequiredSchema,
+  AvroSchemaUtils.removeFieldsFromSchema(dataRequiredSchema, 
rowIndexColumn))
   }
 
-  override def next(): Any = {
-(skeletonFileIterator.next(), dataFileIterator.next()) match {
-  case (s: ColumnarBatch, d: ColumnarBatch) =>
-val numCols = s.numCols() + d.numCols()
-val vecs: Array[ColumnVector] = new Array[ColumnVector](numCols)
-for (i <- 0 until numCols) {
-  if (i < s.numCols()) {
-vecs(i) = s.column(i)
+  //Always use internal row for positional merge because

Review Comment:
   https://issues.apache.org/jira/browse/HUDI-7693



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7834] Create placeholder table versions. Introduce new hoodie table property to track initial table version when table was created. This is needed to identify if the table was originall

2024-06-07 Thread via GitHub



yihua commented on code in PR #11406:
URL: https://github.com/apache/hudi/pull/11406#discussion_r1631737898


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SixToSevenUpgradeHandler.java:
##
@@ -0,0 +1,39 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.upgrade;
+
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import java.util.Collections;
+import java.util.Map;
+
+/**
+ * Version 7 is going to be placeholder version for bridge release 0.16.0.
+ * Version 8 is the placeholder version to track 1.x.
+ */
+public class SixToSevenUpgradeHandler implements UpgradeHandler {
+  @Override
+  public Map upgrade(HoodieWriteConfig config, 
HoodieEngineContext context,
+ String instantTime,
+ SupportsUpgradeDowngrade 
upgradeDowngradeHelper) {
+return Collections.emptyMap();
+  }

Review Comment:
   Makes sense.  Sounds good to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-07 Thread via GitHub



jonvex commented on code in PR #11413:
URL: https://github.com/apache/hudi/pull/11413#discussion_r1631737729


##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/TestHoodiePositionBasedFileGroupRecordBuffer.java:
##
@@ -0,0 +1,214 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi;
+
+import org.apache.hudi.common.config.HoodieStorageConfig;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.engine.HoodieReaderContext;
+import org.apache.hudi.common.model.DeleteRecord;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordMerger;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.TableSchemaResolver;
+import org.apache.hudi.common.table.log.block.HoodieDeleteBlock;
+import org.apache.hudi.common.table.log.block.HoodieLogBlock;
+import 
org.apache.hudi.common.table.read.HoodiePositionBasedFileGroupRecordBuffer;
+import org.apache.hudi.common.table.read.HoodiePositionBasedSchemaHandler;
+import org.apache.hudi.common.table.read.TestHoodieFileGroupReaderOnSpark;
+import org.apache.hudi.common.testutils.HoodieTestDataGenerator;
+import org.apache.hudi.common.testutils.SchemaTestUtil;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.collection.ExternalSpillableMap;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieValidationException;
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.junit.jupiter.api.Test;
+
+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import static 
org.apache.hudi.common.engine.HoodieReaderContext.INTERNAL_META_RECORD_KEY;
+import static org.apache.hudi.common.model.WriteOperationType.INSERT;
+import static 
org.apache.hudi.common.testutils.HoodieTestUtils.createMetaClient;
+import static 
org.apache.hudi.common.testutils.RawTripTestPayload.recordsToStrings;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+import static org.junit.jupiter.api.Assertions.assertNull;
+import static org.junit.jupiter.api.Assertions.assertThrows;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+public class TestHoodiePositionBasedFileGroupRecordBuffer extends 
TestHoodieFileGroupReaderOnSpark {
+  private final HoodieTestDataGenerator dataGen = new 
HoodieTestDataGenerator(0xDEEF);
+  private HoodieTableMetaClient metaClient;
+  private Schema avroSchema;
+  private HoodiePositionBasedFileGroupRecordBuffer buffer;
+  private String partitionPath;
+
+  public void prepareBuffer(boolean useCustomMerger) throws Exception {
+Map writeConfigs = new HashMap<>();
+writeConfigs.put(HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT.key(), 
"parquet");
+writeConfigs.put(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key(), 
"_row_key");
+writeConfigs.put(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME.key(), 
"partition_path");
+writeConfigs.put("hoodie.datasource.write.precombine.field", "timestamp");
+writeConfigs.put("hoodie.payload.ordering.field", "timestamp");
+writeConfigs.put(HoodieTableConfig.HOODIE_TABLE_NAME_KEY, "hoodie_test");
+writeConfigs.put("hoodie.insert.shuffle.parallelism", "4");
+writeConfigs.put("hoodie.upsert.shuffle.parallelism", "4");
+writeConfigs.put("hoodie.bulkinsert.shuffle.parallelism", "2");
+writeConfigs.put("hoodie.delete.shuffle.parallelism", "1");
+writeConfigs.put("hoodie.merge.small.file.group.candidates.limit", "0");
+writeConfigs.put("hoodie.compact.inline", "false");
+

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-07 Thread via GitHub



jonvex commented on code in PR #11413:
URL: https://github.com/apache/hudi/pull/11413#discussion_r1631737122


##
hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieReaderContext.java:
##
@@ -301,9 +311,19 @@ public final UnaryOperator projectRecord(Schema from, 
Schema to) {
* @return the record position in the base file.
*/
   public long extractRecordPosition(T record, Schema schema, String fieldName, 
long providedPositionIfNeeded) {
+if (supportsParquetRowIndex()) {
+  Object position = getValue(record, schema, fieldName);
+  if (position != null) {
+return (long) position;
+  }

Review Comment:
   sure
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-07 Thread via GitHub



yihua commented on code in PR #11413:
URL: https://github.com/apache/hudi/pull/11413#discussion_r1631729256


##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/TestHoodiePositionBasedFileGroupRecordBuffer.java:
##
@@ -0,0 +1,214 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi;
+
+import org.apache.hudi.common.config.HoodieStorageConfig;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.engine.HoodieReaderContext;
+import org.apache.hudi.common.model.DeleteRecord;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordMerger;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.TableSchemaResolver;
+import org.apache.hudi.common.table.log.block.HoodieDeleteBlock;
+import org.apache.hudi.common.table.log.block.HoodieLogBlock;
+import 
org.apache.hudi.common.table.read.HoodiePositionBasedFileGroupRecordBuffer;
+import org.apache.hudi.common.table.read.HoodiePositionBasedSchemaHandler;
+import org.apache.hudi.common.table.read.TestHoodieFileGroupReaderOnSpark;
+import org.apache.hudi.common.testutils.HoodieTestDataGenerator;
+import org.apache.hudi.common.testutils.SchemaTestUtil;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.collection.ExternalSpillableMap;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieValidationException;
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.junit.jupiter.api.Test;
+
+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import static 
org.apache.hudi.common.engine.HoodieReaderContext.INTERNAL_META_RECORD_KEY;
+import static org.apache.hudi.common.model.WriteOperationType.INSERT;
+import static 
org.apache.hudi.common.testutils.HoodieTestUtils.createMetaClient;
+import static 
org.apache.hudi.common.testutils.RawTripTestPayload.recordsToStrings;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+import static org.junit.jupiter.api.Assertions.assertNull;
+import static org.junit.jupiter.api.Assertions.assertThrows;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+public class TestHoodiePositionBasedFileGroupRecordBuffer extends 
TestHoodieFileGroupReaderOnSpark {
+  private final HoodieTestDataGenerator dataGen = new 
HoodieTestDataGenerator(0xDEEF);
+  private HoodieTableMetaClient metaClient;
+  private Schema avroSchema;
+  private HoodiePositionBasedFileGroupRecordBuffer buffer;
+  private String partitionPath;
+
+  public void prepareBuffer(boolean useCustomMerger) throws Exception {
+Map writeConfigs = new HashMap<>();
+writeConfigs.put(HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT.key(), 
"parquet");
+writeConfigs.put(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key(), 
"_row_key");
+writeConfigs.put(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME.key(), 
"partition_path");
+writeConfigs.put("hoodie.datasource.write.precombine.field", "timestamp");
+writeConfigs.put("hoodie.payload.ordering.field", "timestamp");
+writeConfigs.put(HoodieTableConfig.HOODIE_TABLE_NAME_KEY, "hoodie_test");
+writeConfigs.put("hoodie.insert.shuffle.parallelism", "4");
+writeConfigs.put("hoodie.upsert.shuffle.parallelism", "4");
+writeConfigs.put("hoodie.bulkinsert.shuffle.parallelism", "2");
+writeConfigs.put("hoodie.delete.shuffle.parallelism", "1");
+writeConfigs.put("hoodie.merge.small.file.group.candidates.limit", "0");
+writeConfigs.put("hoodie.compact.inline", "false");
+

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11413:
URL: https://github.com/apache/hudi/pull/11413#issuecomment-215555

   
   ## CI report:
   
   * d581b2726ba5047c9e72396820da81ecf1357266 UNKNOWN
   * d00f2862fb8dd8a84fcc5aa1900e76577b8a9bf1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24275)
 
   * 1ce1d753818efb6be00c20fb5a8dd141c7c47f00 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-07 Thread via GitHub



jonvex commented on code in PR #11413:
URL: https://github.com/apache/hudi/pull/11413#discussion_r1631723794


##
hudi-spark-datasource/hudi-spark-common/src/test/scala/org/apache/spark/execution/datasources/parquet/TestSparkFileFormatInternalRowReaderContext.scala:
##
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.execution.datasources.parquet
+
+import org.apache.hudi.SparkFileFormatInternalRowReaderContext
+import 
org.apache.hudi.SparkFileFormatInternalRowReaderContext.filterIsSafeForBootstrap
+import org.apache.hudi.common.model.HoodieRecord
+import 
org.apache.hudi.common.table.read.HoodiePositionBasedFileGroupRecordBuffer.ROW_INDEX_TEMPORARY_COLUMN_NAME
+import org.apache.hudi.testutils.SparkClientFunctionalTestHarness
+import org.apache.spark.sql.sources.{And, IsNotNull, Or}
+import org.apache.spark.sql.types.{LongType, StringType, StructField, 
StructType}
+import org.junit.jupiter.api.Assertions.{assertEquals, assertFalse, assertTrue}
+import org.junit.jupiter.api.Test
+
+class TestSparkFileFormatInternalRowReaderContext extends 
SparkClientFunctionalTestHarness {
+
+  @Test
+  def testBootstrapFilters(): Unit = {
+val recordKeyField = 
HoodieRecord.HoodieMetadataField.RECORD_KEY_METADATA_FIELD.getFieldName
+val commitTimeField = 
HoodieRecord.HoodieMetadataField.COMMIT_TIME_METADATA_FIELD.getFieldName
+
+val recordKeyFilter = IsNotNull(recordKeyField)
+assertTrue(filterIsSafeForBootstrap(recordKeyFilter))
+val commitTimeFilter = IsNotNull(commitTimeField)
+assertTrue(filterIsSafeForBootstrap(commitTimeFilter))
+
+val dataFieldFilter = IsNotNull("someotherfield")
+assertTrue(filterIsSafeForBootstrap(dataFieldFilter))
+
+val legalComplexFilter = Or(recordKeyFilter, commitTimeFilter)
+assertTrue(filterIsSafeForBootstrap(legalComplexFilter))
+
+val illegalComplexFilter = Or(recordKeyFilter, dataFieldFilter)
+assertFalse(filterIsSafeForBootstrap(illegalComplexFilter))
+
+val illegalNestedFilter = And(legalComplexFilter, illegalComplexFilter)
+assertFalse(filterIsSafeForBootstrap(illegalNestedFilter))
+
+val legalNestedFilter = And(legalComplexFilter, recordKeyFilter)
+assertTrue(filterIsSafeForBootstrap(legalNestedFilter))
+  }
+
+  @Test
+  def testGetAppliedRequiredSchema(): Unit = {
+val fields = Array(
+  StructField("column_a", LongType, nullable = false),
+  StructField("column_b", StringType, nullable = false))
+val requiredSchema = StructType(fields)
+
+val appliedSchema: StructType = 
SparkFileFormatInternalRowReaderContext.getAppliedRequiredSchema(

Review Comment:
   TestFiltersInFileGroupReader does tests to ensure that filters are pushed 
down when they should be. I also set breakpoints to make sure the filtering was 
actually happening.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-07 Thread via GitHub



yihua commented on code in PR #11413:
URL: https://github.com/apache/hudi/pull/11413#discussion_r1631715014


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodiePositionBasedSchemaHandler.java:
##
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.engine.HoodieReaderContext;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.internal.schema.InternalSchema;
+
+import org.apache.avro.Schema;
+
+import java.util.Collections;
+import java.util.List;
+
+import static 
org.apache.hudi.avro.AvroSchemaUtils.appendFieldsToSchemaDedupNested;
+
+/**
+ * This class is responsible for handling the schema for the file group reader 
that supports positional merge.
+ */
+public class HoodiePositionBasedSchemaHandler extends 
HoodieFileGroupReaderSchemaHandler {
+  public HoodiePositionBasedSchemaHandler(HoodieReaderContext readerContext,
+  Schema dataSchema,
+  Schema requestedSchema,
+  Option 
internalSchemaOpt,
+  HoodieTableConfig hoodieTableConfig) 
{
+super(readerContext, dataSchema, requestedSchema, internalSchemaOpt, 
hoodieTableConfig);
+

Review Comment:
   nit: remove empty line.



##
hudi-spark-datasource/hudi-spark-common/src/test/scala/org/apache/spark/execution/datasources/parquet/TestSparkFileFormatInternalRowReaderContext.scala:
##
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.execution.datasources.parquet
+
+import org.apache.hudi.SparkFileFormatInternalRowReaderContext
+import 
org.apache.hudi.SparkFileFormatInternalRowReaderContext.filterIsSafeForBootstrap
+import org.apache.hudi.common.model.HoodieRecord
+import 
org.apache.hudi.common.table.read.HoodiePositionBasedFileGroupRecordBuffer.ROW_INDEX_TEMPORARY_COLUMN_NAME
+import org.apache.hudi.testutils.SparkClientFunctionalTestHarness
+import org.apache.spark.sql.sources.{And, IsNotNull, Or}
+import org.apache.spark.sql.types.{LongType, StringType, StructField, 
StructType}
+import org.junit.jupiter.api.Assertions.{assertEquals, assertFalse, assertTrue}
+import org.junit.jupiter.api.Test
+
+class TestSparkFileFormatInternalRowReaderContext extends 
SparkClientFunctionalTestHarness {
+
+  @Test
+  def testBootstrapFilters(): Unit = {
+val recordKeyField = 
HoodieRecord.HoodieMetadataField.RECORD_KEY_METADATA_FIELD.getFieldName
+val commitTimeField = 
HoodieRecord.HoodieMetadataField.COMMIT_TIME_METADATA_FIELD.getFieldName
+
+val recordKeyFilter = IsNotNull(recordKeyField)
+assertTrue(filterIsSafeForBootstrap(recordKeyFilter))
+val commitTimeFilter = IsNotNull(commitTimeField)
+assertTrue(filterIsSafeForBootstrap(commitTimeFilter))
+
+val dataFieldFilter = IsNotNull("someotherfield")
+assertTrue(filterIsSafeForBootstrap(dataFieldFilter))
+
+val legalComplexFilter = Or(recordKeyFilter, commitTimeFilter)
+assertTrue(filterIsSafeForBootstrap(legalComplexFilter))
+
+val illegalComplexFilter = Or(recordKeyFilter, dataFieldFilter)
+assertFalse(filterIsSafeForBootstrap(illegalComplexFilter))
+
+val

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-07 Thread via GitHub



jonvex commented on code in PR #11413:
URL: https://github.com/apache/hudi/pull/11413#discussion_r1631715749


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala:
##
@@ -116,45 +143,154 @@ class 
SparkFileFormatInternalRowReaderContext(parquetFileReader: SparkParquetRea
  skeletonRequiredSchema: Schema,
  dataFileIterator: 
ClosableIterator[InternalRow],
  dataRequiredSchema: Schema): 
ClosableIterator[InternalRow] = {
-doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]],
-  dataFileIterator.asInstanceOf[ClosableIterator[Any]])
+doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]], 
skeletonRequiredSchema,
+  dataFileIterator.asInstanceOf[ClosableIterator[Any]], dataRequiredSchema)
   }
 
-  protected def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any], 
dataFileIterator: ClosableIterator[Any]): ClosableIterator[InternalRow] = {
-new ClosableIterator[Any] {
-  val combinedRow = new JoinedRow()
+  private def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any],
+   skeletonRequiredSchema: Schema,
+   dataFileIterator: ClosableIterator[Any],
+   dataRequiredSchema: Schema): 
ClosableIterator[InternalRow] = {
+if (supportsPositionField()) {
+  assert(AvroSchemaUtils.containsFieldInSchema(skeletonRequiredSchema, 
ROW_INDEX_TEMPORARY_COLUMN_NAME))
+  assert(AvroSchemaUtils.containsFieldInSchema(dataRequiredSchema, 
ROW_INDEX_TEMPORARY_COLUMN_NAME))
+  val rowIndexColumn = new java.util.HashSet[String]()
+  rowIndexColumn.add(ROW_INDEX_TEMPORARY_COLUMN_NAME)
+  //always remove the row index column from the skeleton because the data 
file will also have the same column
+  val skeletonProjection = projectRecord(skeletonRequiredSchema,
+AvroSchemaUtils.removeFieldsFromSchema(skeletonRequiredSchema, 
rowIndexColumn))
 
-  override def hasNext: Boolean = {
-//If the iterators are out of sync it is probably due to filter 
pushdown
-checkState(dataFileIterator.hasNext == skeletonFileIterator.hasNext,
-  "Bootstrap data-file iterator and skeleton-file iterator have to be 
in-sync!")
-dataFileIterator.hasNext && skeletonFileIterator.hasNext
+  //If we need to do position based merging with log files we will leave 
the row index column at the end
+  val dataProjection = if (getHasLogFiles && getUseRecordPosition) {
+getIdentityProjection
+  } else {
+projectRecord(dataRequiredSchema,
+  AvroSchemaUtils.removeFieldsFromSchema(dataRequiredSchema, 
rowIndexColumn))
   }
 
-  override def next(): Any = {
-(skeletonFileIterator.next(), dataFileIterator.next()) match {
-  case (s: ColumnarBatch, d: ColumnarBatch) =>
-val numCols = s.numCols() + d.numCols()
-val vecs: Array[ColumnVector] = new Array[ColumnVector](numCols)
-for (i <- 0 until numCols) {
-  if (i < s.numCols()) {
-vecs(i) = s.column(i)
+  //Always use internal row for positional merge because
+  //we need to iterate row by row when merging
+  new CachingIterator[InternalRow] {
+val combinedRow = new JoinedRow()
+
+//position column will always be at the end of the row
+private def getPos(row: InternalRow): Long = {
+  row.getLong(row.numFields-1)
+}
+
+private def getNextSkeleton: (InternalRow, Long) = {
+  val nextSkeletonRow = 
skeletonFileIterator.next().asInstanceOf[InternalRow]
+  (nextSkeletonRow, getPos(nextSkeletonRow))
+}
+
+private def getNextData: (InternalRow, Long) = {
+  val nextDataRow = dataFileIterator.next().asInstanceOf[InternalRow]
+  (nextDataRow, getPos(nextDataRow))
+}
+
+override def close(): Unit = {
+  skeletonFileIterator.close()
+  dataFileIterator.close()
+}
+
+override protected def doHasNext(): Boolean = {
+  if (!dataFileIterator.hasNext || !skeletonFileIterator.hasNext) {
+false
+  } else {
+var nextSkeleton = getNextSkeleton
+var nextData = getNextData
+while (nextSkeleton._2 != nextData._2) {
+  if (nextSkeleton._2 > nextData._2) {
+if (!dataFileIterator.hasNext) {
+  return false
+} else {
+  nextData = getNextData
+}
   } else {
-vecs(i) = d.column(i - s.numCols())
+if (!skeletonFileIterator.hasNext) {
+  return false
+} else {
+  nextSkeleton = getNextSkeleton
+}

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-07 Thread via GitHub



jonvex commented on code in PR #11413:
URL: https://github.com/apache/hudi/pull/11413#discussion_r1631715307


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala:
##
@@ -116,45 +143,154 @@ class 
SparkFileFormatInternalRowReaderContext(parquetFileReader: SparkParquetRea
  skeletonRequiredSchema: Schema,
  dataFileIterator: 
ClosableIterator[InternalRow],
  dataRequiredSchema: Schema): 
ClosableIterator[InternalRow] = {
-doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]],
-  dataFileIterator.asInstanceOf[ClosableIterator[Any]])
+doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]], 
skeletonRequiredSchema,
+  dataFileIterator.asInstanceOf[ClosableIterator[Any]], dataRequiredSchema)
   }
 
-  protected def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any], 
dataFileIterator: ClosableIterator[Any]): ClosableIterator[InternalRow] = {
-new ClosableIterator[Any] {
-  val combinedRow = new JoinedRow()
+  private def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any],
+   skeletonRequiredSchema: Schema,
+   dataFileIterator: ClosableIterator[Any],
+   dataRequiredSchema: Schema): 
ClosableIterator[InternalRow] = {
+if (supportsPositionField()) {
+  assert(AvroSchemaUtils.containsFieldInSchema(skeletonRequiredSchema, 
ROW_INDEX_TEMPORARY_COLUMN_NAME))
+  assert(AvroSchemaUtils.containsFieldInSchema(dataRequiredSchema, 
ROW_INDEX_TEMPORARY_COLUMN_NAME))
+  val rowIndexColumn = new java.util.HashSet[String]()
+  rowIndexColumn.add(ROW_INDEX_TEMPORARY_COLUMN_NAME)
+  //always remove the row index column from the skeleton because the data 
file will also have the same column
+  val skeletonProjection = projectRecord(skeletonRequiredSchema,
+AvroSchemaUtils.removeFieldsFromSchema(skeletonRequiredSchema, 
rowIndexColumn))
 
-  override def hasNext: Boolean = {
-//If the iterators are out of sync it is probably due to filter 
pushdown
-checkState(dataFileIterator.hasNext == skeletonFileIterator.hasNext,
-  "Bootstrap data-file iterator and skeleton-file iterator have to be 
in-sync!")
-dataFileIterator.hasNext && skeletonFileIterator.hasNext
+  //If we need to do position based merging with log files we will leave 
the row index column at the end
+  val dataProjection = if (getHasLogFiles && getUseRecordPosition) {
+getIdentityProjection
+  } else {
+projectRecord(dataRequiredSchema,
+  AvroSchemaUtils.removeFieldsFromSchema(dataRequiredSchema, 
rowIndexColumn))
   }
 
-  override def next(): Any = {
-(skeletonFileIterator.next(), dataFileIterator.next()) match {
-  case (s: ColumnarBatch, d: ColumnarBatch) =>
-val numCols = s.numCols() + d.numCols()
-val vecs: Array[ColumnVector] = new Array[ColumnVector](numCols)
-for (i <- 0 until numCols) {
-  if (i < s.numCols()) {
-vecs(i) = s.column(i)
+  //Always use internal row for positional merge because

Review Comment:
   I think the filtering is actually done by batch as well, so I think we 
wouldn't need to iterate through the rows themselves



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-07 Thread via GitHub



yihua commented on code in PR #11413:
URL: https://github.com/apache/hudi/pull/11413#discussion_r1631711150


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala:
##
@@ -116,45 +143,154 @@ class 
SparkFileFormatInternalRowReaderContext(parquetFileReader: SparkParquetRea
  skeletonRequiredSchema: Schema,
  dataFileIterator: 
ClosableIterator[InternalRow],
  dataRequiredSchema: Schema): 
ClosableIterator[InternalRow] = {
-doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]],
-  dataFileIterator.asInstanceOf[ClosableIterator[Any]])
+doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]], 
skeletonRequiredSchema,
+  dataFileIterator.asInstanceOf[ClosableIterator[Any]], dataRequiredSchema)
   }
 
-  protected def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any], 
dataFileIterator: ClosableIterator[Any]): ClosableIterator[InternalRow] = {
-new ClosableIterator[Any] {
-  val combinedRow = new JoinedRow()
+  private def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any],
+   skeletonRequiredSchema: Schema,
+   dataFileIterator: ClosableIterator[Any],
+   dataRequiredSchema: Schema): 
ClosableIterator[InternalRow] = {
+if (supportsPositionField()) {
+  assert(AvroSchemaUtils.containsFieldInSchema(skeletonRequiredSchema, 
ROW_INDEX_TEMPORARY_COLUMN_NAME))
+  assert(AvroSchemaUtils.containsFieldInSchema(dataRequiredSchema, 
ROW_INDEX_TEMPORARY_COLUMN_NAME))
+  val rowIndexColumn = new java.util.HashSet[String]()
+  rowIndexColumn.add(ROW_INDEX_TEMPORARY_COLUMN_NAME)
+  //always remove the row index column from the skeleton because the data 
file will also have the same column
+  val skeletonProjection = projectRecord(skeletonRequiredSchema,
+AvroSchemaUtils.removeFieldsFromSchema(skeletonRequiredSchema, 
rowIndexColumn))
 
-  override def hasNext: Boolean = {
-//If the iterators are out of sync it is probably due to filter 
pushdown
-checkState(dataFileIterator.hasNext == skeletonFileIterator.hasNext,
-  "Bootstrap data-file iterator and skeleton-file iterator have to be 
in-sync!")
-dataFileIterator.hasNext && skeletonFileIterator.hasNext
+  //If we need to do position based merging with log files we will leave 
the row index column at the end
+  val dataProjection = if (getHasLogFiles && getUseRecordPosition) {
+getIdentityProjection
+  } else {
+projectRecord(dataRequiredSchema,
+  AvroSchemaUtils.removeFieldsFromSchema(dataRequiredSchema, 
rowIndexColumn))
   }
 
-  override def next(): Any = {
-(skeletonFileIterator.next(), dataFileIterator.next()) match {
-  case (s: ColumnarBatch, d: ColumnarBatch) =>
-val numCols = s.numCols() + d.numCols()
-val vecs: Array[ColumnVector] = new Array[ColumnVector](numCols)
-for (i <- 0 until numCols) {
-  if (i < s.numCols()) {
-vecs(i) = s.column(i)
+  //Always use internal row for positional merge because

Review Comment:
   We can still iterate through rows within the `ColumnarBatch` in the 
vectorized processing.  We can leave that as a follow-up.



##
hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieReaderContext.java:
##
@@ -122,6 +123,15 @@ public void setNeedsBootstrapMerge(boolean 
needsBootstrapMerge) {
 this.needsBootstrapMerge = needsBootstrapMerge;
   }
 
+  // Getter and Setter for useRecordPosition
+  public boolean getUseRecordPosition() {
+return useRecordPosition;
+  }

Review Comment:
   Rename the getter and setter to sth like `shouldMergeUseRecordPosition` and 
`setMergeUseRecordPosition` so it indicates this is used for controlling the 
merging behavior.



##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala:
##
@@ -116,45 +143,154 @@ class 
SparkFileFormatInternalRowReaderContext(parquetFileReader: SparkParquetRea
  skeletonRequiredSchema: Schema,
  dataFileIterator: 
ClosableIterator[InternalRow],
  dataRequiredSchema: Schema): 
ClosableIterator[InternalRow] = {
-doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]],
-  dataFileIterator.asInstanceOf[ClosableIterator[Any]])
+doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]], 
skeletonRequiredSchema,
+  dataFileIterator.asInstanceOf[ClosableIterator[Any]], dataRequiredSchema)
   }
 
-  protected def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any], 
dataFileIterator:

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-07 Thread via GitHub



jonvex commented on code in PR #11413:
URL: https://github.com/apache/hudi/pull/11413#discussion_r1631710563


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala:
##
@@ -116,45 +143,154 @@ class 
SparkFileFormatInternalRowReaderContext(parquetFileReader: SparkParquetRea
  skeletonRequiredSchema: Schema,
  dataFileIterator: 
ClosableIterator[InternalRow],
  dataRequiredSchema: Schema): 
ClosableIterator[InternalRow] = {
-doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]],
-  dataFileIterator.asInstanceOf[ClosableIterator[Any]])
+doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]], 
skeletonRequiredSchema,
+  dataFileIterator.asInstanceOf[ClosableIterator[Any]], dataRequiredSchema)
   }
 
-  protected def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any], 
dataFileIterator: ClosableIterator[Any]): ClosableIterator[InternalRow] = {
-new ClosableIterator[Any] {
-  val combinedRow = new JoinedRow()
+  private def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any],
+   skeletonRequiredSchema: Schema,
+   dataFileIterator: ClosableIterator[Any],
+   dataRequiredSchema: Schema): 
ClosableIterator[InternalRow] = {
+if (supportsPositionField()) {
+  assert(AvroSchemaUtils.containsFieldInSchema(skeletonRequiredSchema, 
ROW_INDEX_TEMPORARY_COLUMN_NAME))
+  assert(AvroSchemaUtils.containsFieldInSchema(dataRequiredSchema, 
ROW_INDEX_TEMPORARY_COLUMN_NAME))
+  val rowIndexColumn = new java.util.HashSet[String]()
+  rowIndexColumn.add(ROW_INDEX_TEMPORARY_COLUMN_NAME)
+  //always remove the row index column from the skeleton because the data 
file will also have the same column
+  val skeletonProjection = projectRecord(skeletonRequiredSchema,
+AvroSchemaUtils.removeFieldsFromSchema(skeletonRequiredSchema, 
rowIndexColumn))
 
-  override def hasNext: Boolean = {
-//If the iterators are out of sync it is probably due to filter 
pushdown
-checkState(dataFileIterator.hasNext == skeletonFileIterator.hasNext,
-  "Bootstrap data-file iterator and skeleton-file iterator have to be 
in-sync!")
-dataFileIterator.hasNext && skeletonFileIterator.hasNext
+  //If we need to do position based merging with log files we will leave 
the row index column at the end
+  val dataProjection = if (getHasLogFiles && getUseRecordPosition) {
+getIdentityProjection
+  } else {
+projectRecord(dataRequiredSchema,
+  AvroSchemaUtils.removeFieldsFromSchema(dataRequiredSchema, 
rowIndexColumn))
   }
 
-  override def next(): Any = {
-(skeletonFileIterator.next(), dataFileIterator.next()) match {
-  case (s: ColumnarBatch, d: ColumnarBatch) =>
-val numCols = s.numCols() + d.numCols()
-val vecs: Array[ColumnVector] = new Array[ColumnVector](numCols)
-for (i <- 0 until numCols) {
-  if (i < s.numCols()) {
-vecs(i) = s.column(i)
+  //Always use internal row for positional merge because
+  //we need to iterate row by row when merging
+  new CachingIterator[InternalRow] {
+val combinedRow = new JoinedRow()
+
+//position column will always be at the end of the row
+private def getPos(row: InternalRow): Long = {
+  row.getLong(row.numFields-1)
+}
+
+private def getNextSkeleton: (InternalRow, Long) = {
+  val nextSkeletonRow = 
skeletonFileIterator.next().asInstanceOf[InternalRow]
+  (nextSkeletonRow, getPos(nextSkeletonRow))
+}
+
+private def getNextData: (InternalRow, Long) = {
+  val nextDataRow = dataFileIterator.next().asInstanceOf[InternalRow]
+  (nextDataRow, getPos(nextDataRow))
+}
+
+override def close(): Unit = {
+  skeletonFileIterator.close()
+  dataFileIterator.close()
+}
+
+override protected def doHasNext(): Boolean = {
+  if (!dataFileIterator.hasNext || !skeletonFileIterator.hasNext) {
+false
+  } else {
+var nextSkeleton = getNextSkeleton
+var nextData = getNextData
+while (nextSkeleton._2 != nextData._2) {
+  if (nextSkeleton._2 > nextData._2) {
+if (!dataFileIterator.hasNext) {
+  return false
+} else {
+  nextData = getNextData
+}
   } else {
-vecs(i) = d.column(i - s.numCols())
+if (!skeletonFileIterator.hasNext) {
+  return false
+} else {
+  nextSkeleton = getNextSkeleton
+}

Re: [PR] [HUDI-6787] Implement the HoodieFileGroupReader API for Hive [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #10422:
URL: https://github.com/apache/hudi/pull/10422#issuecomment-2155580839

   
   ## CI report:
   
   * 99517e23baa60a6a0602e9daf7f522f3c1dcfa1e UNKNOWN
   * 33249cc712c6dcdde12efe8536579d3c9c5f8575 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24279)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-07 Thread via GitHub



yihua commented on code in PR #11413:
URL: https://github.com/apache/hudi/pull/11413#discussion_r1631704549


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala:
##
@@ -116,45 +143,154 @@ class 
SparkFileFormatInternalRowReaderContext(parquetFileReader: SparkParquetRea
  skeletonRequiredSchema: Schema,
  dataFileIterator: 
ClosableIterator[InternalRow],
  dataRequiredSchema: Schema): 
ClosableIterator[InternalRow] = {
-doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]],
-  dataFileIterator.asInstanceOf[ClosableIterator[Any]])
+doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]], 
skeletonRequiredSchema,
+  dataFileIterator.asInstanceOf[ClosableIterator[Any]], dataRequiredSchema)
   }
 
-  protected def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any], 
dataFileIterator: ClosableIterator[Any]): ClosableIterator[InternalRow] = {
-new ClosableIterator[Any] {
-  val combinedRow = new JoinedRow()
+  private def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any],
+   skeletonRequiredSchema: Schema,
+   dataFileIterator: ClosableIterator[Any],
+   dataRequiredSchema: Schema): 
ClosableIterator[InternalRow] = {
+if (supportsPositionField()) {
+  assert(AvroSchemaUtils.containsFieldInSchema(skeletonRequiredSchema, 
ROW_INDEX_TEMPORARY_COLUMN_NAME))
+  assert(AvroSchemaUtils.containsFieldInSchema(dataRequiredSchema, 
ROW_INDEX_TEMPORARY_COLUMN_NAME))
+  val rowIndexColumn = new java.util.HashSet[String]()
+  rowIndexColumn.add(ROW_INDEX_TEMPORARY_COLUMN_NAME)
+  //always remove the row index column from the skeleton because the data 
file will also have the same column
+  val skeletonProjection = projectRecord(skeletonRequiredSchema,
+AvroSchemaUtils.removeFieldsFromSchema(skeletonRequiredSchema, 
rowIndexColumn))
 
-  override def hasNext: Boolean = {
-//If the iterators are out of sync it is probably due to filter 
pushdown
-checkState(dataFileIterator.hasNext == skeletonFileIterator.hasNext,
-  "Bootstrap data-file iterator and skeleton-file iterator have to be 
in-sync!")
-dataFileIterator.hasNext && skeletonFileIterator.hasNext
+  //If we need to do position based merging with log files we will leave 
the row index column at the end
+  val dataProjection = if (getHasLogFiles && getUseRecordPosition) {
+getIdentityProjection
+  } else {
+projectRecord(dataRequiredSchema,
+  AvroSchemaUtils.removeFieldsFromSchema(dataRequiredSchema, 
rowIndexColumn))
   }
 
-  override def next(): Any = {
-(skeletonFileIterator.next(), dataFileIterator.next()) match {
-  case (s: ColumnarBatch, d: ColumnarBatch) =>
-val numCols = s.numCols() + d.numCols()
-val vecs: Array[ColumnVector] = new Array[ColumnVector](numCols)
-for (i <- 0 until numCols) {
-  if (i < s.numCols()) {
-vecs(i) = s.column(i)
+  //Always use internal row for positional merge because
+  //we need to iterate row by row when merging
+  new CachingIterator[InternalRow] {
+val combinedRow = new JoinedRow()
+
+//position column will always be at the end of the row
+private def getPos(row: InternalRow): Long = {
+  row.getLong(row.numFields-1)
+}
+
+private def getNextSkeleton: (InternalRow, Long) = {
+  val nextSkeletonRow = 
skeletonFileIterator.next().asInstanceOf[InternalRow]
+  (nextSkeletonRow, getPos(nextSkeletonRow))
+}
+
+private def getNextData: (InternalRow, Long) = {
+  val nextDataRow = dataFileIterator.next().asInstanceOf[InternalRow]
+  (nextDataRow, getPos(nextDataRow))
+}
+
+override def close(): Unit = {
+  skeletonFileIterator.close()
+  dataFileIterator.close()
+}
+
+override protected def doHasNext(): Boolean = {
+  if (!dataFileIterator.hasNext || !skeletonFileIterator.hasNext) {
+false
+  } else {
+var nextSkeleton = getNextSkeleton
+var nextData = getNextData
+while (nextSkeleton._2 != nextData._2) {
+  if (nextSkeleton._2 > nextData._2) {
+if (!dataFileIterator.hasNext) {
+  return false
+} else {
+  nextData = getNextData
+}
   } else {
-vecs(i) = d.column(i - s.numCols())
+if (!skeletonFileIterator.hasNext) {
+  return false
+} else {
+  nextSkeleton = getNextSkeleton
+}

Re: [PR] [HUDI-6787] Implement the HoodieFileGroupReader API for Hive [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #10422:
URL: https://github.com/apache/hudi/pull/10422#issuecomment-2155513457

   
   ## CI report:
   
   * 99517e23baa60a6a0602e9daf7f522f3c1dcfa1e UNKNOWN
   * b29ff638867f3760156318bb58a7677c67a415dc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24274)
 
   * 33249cc712c6dcdde12efe8536579d3c9c5f8575 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24279)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6787] Implement the HoodieFileGroupReader API for Hive [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #10422:
URL: https://github.com/apache/hudi/pull/10422#issuecomment-2155505446

   
   ## CI report:
   
   * 99517e23baa60a6a0602e9daf7f522f3c1dcfa1e UNKNOWN
   * b29ff638867f3760156318bb58a7677c67a415dc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24274)
 
   * 33249cc712c6dcdde12efe8536579d3c9c5f8575 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11413:
URL: https://github.com/apache/hudi/pull/11413#issuecomment-2155498531

   
   ## CI report:
   
   * d581b2726ba5047c9e72396820da81ecf1357266 UNKNOWN
   * d00f2862fb8dd8a84fcc5aa1900e76577b8a9bf1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24275)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6787] Implement the HoodieFileGroupReader API for Hive [hudi]

2024-06-07 Thread via GitHub



jonvex commented on code in PR #10422:
URL: https://github.com/apache/hudi/pull/10422#discussion_r1631640824


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileGroupReaderRecordReader.java:
##
@@ -0,0 +1,294 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.hadoop;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.config.HoodieCommonConfig;
+import org.apache.hudi.common.config.HoodieReaderConfig;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.BaseFile;
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieFileGroupId;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.TableSchemaResolver;
+import org.apache.hudi.common.table.read.HoodieFileGroupReader;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.FileIOUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.TablePathUtils;
+import org.apache.hudi.common.util.collection.ExternalSpillableMap;
+import org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader;
+import org.apache.hudi.hadoop.realtime.RealtimeSplit;
+import org.apache.hudi.hadoop.utils.HoodieRealtimeInputFormatUtils;
+import org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils;
+
+import org.apache.avro.Schema;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hive.metastore.api.hive_metastoreConstants;
+import org.apache.hadoop.hive.serde2.ColumnProjectionUtils;
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.io.Writable;
+import org.apache.hadoop.mapred.FileSplit;
+import org.apache.hadoop.mapred.InputSplit;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.RecordReader;
+import org.apache.hadoop.mapred.Reporter;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Locale;
+import java.util.Map;
+import java.util.Set;
+import java.util.function.UnaryOperator;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static 
org.apache.hudi.common.config.HoodieCommonConfig.DISK_MAP_BITCASK_COMPRESSION_ENABLED;
+import static 
org.apache.hudi.common.config.HoodieCommonConfig.SPILLABLE_DISK_MAP_TYPE;
+import static 
org.apache.hudi.common.config.HoodieMemoryConfig.MAX_MEMORY_FOR_MERGE;
+import static 
org.apache.hudi.common.config.HoodieMemoryConfig.SPILLABLE_MAP_BASE_PATH;
+
+public class HoodieFileGroupReaderRecordReader implements 
RecordReader  {
+
+  public interface HiveReaderCreator {
+org.apache.hadoop.mapred.RecordReader 
getRecordReader(
+final org.apache.hadoop.mapred.InputSplit split,
+final org.apache.hadoop.mapred.JobConf job,
+final org.apache.hadoop.mapred.Reporter reporter
+) throws IOException;
+  }
+
+  private final HiveHoodieReaderContext readerContext;
+  private final HoodieFileGroupReader fileGroupReader;
+  private final ArrayWritable arrayWritable;
+  private final NullWritable nullWritable = NullWritable.get();
+  private final InputSplit inputSplit;
+  private final JobConf jobConfCopy;
+  private final UnaryOperator reverseProjection;
+
+  public HoodieFileGroupReaderRecordReader(HiveReaderCreator readerCreator,
+   final InputSplit split,
+   final JobConf jobConf,
+   final Reporter reporter) throws 
IOException {
+this.jobConfCopy = new JobConf(jobConf);
+HoodieRealtimeInputFormatUtils.cleanProjectionColumnIds(jobConfCopy);
+Set partitionColumns = new 
HashSet<>(getPartitionFieldNames(jobConfCopy));
+this.inputSplit = split;
+
+FileSplit fileSplit = (FileSplit) split;
+String tableBasePath = getTableBasePath(split,

Re: [PR] [HUDI-6787] Implement the HoodieFileGroupReader API for Hive [hudi]

2024-06-07 Thread via GitHub



jonvex commented on code in PR #10422:
URL: https://github.com/apache/hudi/pull/10422#discussion_r1631634950


##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestHoodieSparkMergeOnReadTableCompaction.java:
##
@@ -146,43 +147,50 @@ public void testWriteDuringCompaction(String 
payloadClass) throws IOException {
   @ParameterizedTest
   @MethodSource("writeLogTest")
   public void testWriteLogDuringCompaction(boolean enableMetadataTable, 
boolean enableTimelineServer) throws IOException {
-Properties props = getPropertiesForKeyGen(true);
-HoodieWriteConfig config = HoodieWriteConfig.newBuilder()
-.forTable("test-trip-table")
-.withPath(basePath())
-.withSchema(TRIP_EXAMPLE_SCHEMA)
-.withParallelism(2, 2)
-.withAutoCommit(true)
-.withEmbeddedTimelineServerEnabled(enableTimelineServer)
-
.withMetadataConfig(HoodieMetadataConfig.newBuilder().enable(enableMetadataTable).build())
-.withCompactionConfig(HoodieCompactionConfig.newBuilder()
-.withMaxNumDeltaCommitsBeforeCompaction(1).build())
-.withLayoutConfig(HoodieLayoutConfig.newBuilder()
-.withLayoutType(HoodieStorageLayout.LayoutType.BUCKET.name())
-
.withLayoutPartitioner(SparkBucketIndexPartitioner.class.getName()).build())
-
.withIndexConfig(HoodieIndexConfig.newBuilder().fromProperties(props).withIndexType(HoodieIndex.IndexType.BUCKET).withBucketNum("1").build())
-.build();
-props.putAll(config.getProps());
-
-metaClient = getHoodieMetaClient(HoodieTableType.MERGE_ON_READ, props);
-client = getHoodieWriteClient(config);
-
-final List records = dataGen.generateInserts("001", 100);
-JavaRDD writeRecords = jsc().parallelize(records, 2);
+try {
+  //disable for this test because it seems like we process mor in a 
different order?

Review Comment:
   https://issues.apache.org/jira/browse/HUDI-7610 Delete behavior is 
inconsistent and imo undefined. This is one of the advantages of unifying all 
the readers with FGReader is that we can remove the inconsistency between 
engines. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7269] Fallback to key based merge if positions are missing from log block [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11415:
URL: https://github.com/apache/hudi/pull/11415#issuecomment-2155410555

   
   ## CI report:
   
   * 644a1d216307d8660ff7654c5273f2356974bcb8 UNKNOWN
   * bfea0d3a2dd9e6ba2d96c1d7d20a07e085883da6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24278)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7269] Fallback to key based merge if positions are missing from log block [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11415:
URL: https://github.com/apache/hudi/pull/11415#issuecomment-2155393640

   
   ## CI report:
   
   * 644a1d216307d8660ff7654c5273f2356974bcb8 UNKNOWN
   * 40932069f637e82d80731fe8625331d293fdc1e0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24277)
 
   * bfea0d3a2dd9e6ba2d96c1d7d20a07e085883da6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7834] Create placeholder table versions. Introduce new hoodie table property to track initial table version when table was created. This is needed to identify if the table was originall

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11406:
URL: https://github.com/apache/hudi/pull/11406#issuecomment-2155393533

   
   ## CI report:
   
   * da08a0b3c0524b46e70a4cbed8ab82eb5f84f24c UNKNOWN
   * 901c7f94b1b56ac19867d5d0deab34eb35ebce2c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24276)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7834] Create placeholder table versions. Introduce new hoodie table property to track initial table version when table was created. This is needed to identify if the table was originall

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11406:
URL: https://github.com/apache/hudi/pull/11406#issuecomment-2155337800

   
   ## CI report:
   
   * e8a80e29e51c84a3403906d2acf0aeee24dedda4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24245)
 
   * da08a0b3c0524b46e70a4cbed8ab82eb5f84f24c UNKNOWN
   * 901c7f94b1b56ac19867d5d0deab34eb35ebce2c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7269] Fallback to key based merge if positions are missing from log block [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11415:
URL: https://github.com/apache/hudi/pull/11415#issuecomment-2155337978

   
   ## CI report:
   
   * 644a1d216307d8660ff7654c5273f2356974bcb8 UNKNOWN
   * 40932069f637e82d80731fe8625331d293fdc1e0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11413:
URL: https://github.com/apache/hudi/pull/11413#issuecomment-2155337873

   
   ## CI report:
   
   * d581b2726ba5047c9e72396820da81ecf1357266 UNKNOWN
   * 4e6335c7cfb18881776d572954558a41aa33b91d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24273)
 
   * d00f2862fb8dd8a84fcc5aa1900e76577b8a9bf1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24275)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7269] Fallback to key based merge if positions are missing from log block [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11415:
URL: https://github.com/apache/hudi/pull/11415#issuecomment-2155328233

   
   ## CI report:
   
   * 644a1d216307d8660ff7654c5273f2356974bcb8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11413:
URL: https://github.com/apache/hudi/pull/11413#issuecomment-2155328163

   
   ## CI report:
   
   * d581b2726ba5047c9e72396820da81ecf1357266 UNKNOWN
   * 4e6335c7cfb18881776d572954558a41aa33b91d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24273)
 
   * d00f2862fb8dd8a84fcc5aa1900e76577b8a9bf1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7834] Create placeholder table versions. Introduce new hoodie table property to track initial table version when table was created. This is needed to identify if the table was originall

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11406:
URL: https://github.com/apache/hudi/pull/11406#issuecomment-2155328082

   
   ## CI report:
   
   * e8a80e29e51c84a3403906d2acf0aeee24dedda4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24245)
 
   * da08a0b3c0524b46e70a4cbed8ab82eb5f84f24c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7834] Create placeholder table versions. Introduce new hoodie table property to track initial table version when table was created. This is needed to identify if the table was originall

2024-06-07 Thread via GitHub



balaji-varadarajan commented on code in PR #11406:
URL: https://github.com/apache/hudi/pull/11406#discussion_r1631564698


##
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java:
##
@@ -623,6 +624,8 @@ private static void 
initTableMetaClient(StorageConfiguration storageConf, Str
 }
 
 initializeBootstrapDirsIfNotExists(basePath, storage);
+// When the table is initialized, set the initial version to be the 
current version.
+props.put(INITIAL_VERSION.key(), 
String.valueOf(HoodieTableVersion.current().versionCode()));

Review Comment:
   Good point. Found one place in RepairsCommand and added the fix. 



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SixToSevenUpgradeHandler.java:
##
@@ -0,0 +1,39 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.upgrade;
+
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import java.util.Collections;
+import java.util.Map;
+
+/**
+ * Version 7 is going to be placeholder version for bridge release 0.16.0.
+ * Version 8 is the placeholder version to track 1.x.

Review Comment:
   Done



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SevenToSixDowngradeHandler.java:
##
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.upgrade;
+
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import java.util.Collections;
+import java.util.Map;
+
+/**
+ * Version 7 is going to be placeholder version for bridge release 0.16.0.
+ * Version 8 is the placeholder version to track 1.x.

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7834] Create placeholder table versions. Introduce new hoodie table property to track initial table version when table was created. This is needed to identify if the table was originall

2024-06-07 Thread via GitHub



balaji-varadarajan commented on code in PR #11406:
URL: https://github.com/apache/hudi/pull/11406#discussion_r1631549810


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SixToSevenUpgradeHandler.java:
##
@@ -0,0 +1,39 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.upgrade;
+
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import java.util.Collections;
+import java.util.Map;
+
+/**
+ * Version 7 is going to be placeholder version for bridge release 0.16.0.
+ * Version 8 is the placeholder version to track 1.x.
+ */
+public class SixToSevenUpgradeHandler implements UpgradeHandler {
+  @Override
+  public Map upgrade(HoodieWriteConfig config, 
HoodieEngineContext context,
+ String instantTime,
+ SupportsUpgradeDowngrade 
upgradeDowngradeHelper) {
+return Collections.emptyMap();
+  }

Review Comment:
   We cannot determine the correct initial version during the upgrade path as 
we are doing one version increment at a time. We can basically interpret 
absence of INITIAL_VERSION as that the table was created by some version of 0.x 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [HUDI-7269] Fallback to key based merge if positions are missing from log block [hudi]

2024-06-07 Thread via GitHub



jonvex opened a new pull request, #11415:
URL: https://github.com/apache/hudi/pull/11415

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7269] Fallback to key based merge if positions are missing from log block [hudi]

2024-06-07 Thread via GitHub



jonvex closed pull request #10991: [HUDI-7269] Fallback to key based merge if 
positions are missing from log block
URL: https://github.com/apache/hudi/pull/10991


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7834] Create placeholder table versions. Introduce new hoodie table property to track initial table version when table was created. This is needed to identify if the table was originall

2024-06-07 Thread via GitHub



balaji-varadarajan commented on code in PR #11406:
URL: https://github.com/apache/hudi/pull/11406#discussion_r1631546670


##
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java:
##
@@ -512,6 +519,15 @@ public HoodieTableVersion getTableVersion() {
 : VERSION.defaultValue();
   }
 
+  /**
+   * @return the hoodie.table.initial.version from hoodie.properties file.
+   */
+  public HoodieTableVersion getTableInitialVersion() {
+return contains(INITIAL_VERSION)
+? HoodieTableVersion.versionFromCode(getInt(INITIAL_VERSION))

Review Comment:
   INITIAL_VERSION is similar to VERSION in type. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7841) RLI and secondary index should consider only pruned partitions for file skipping

2024-06-07 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7841:
--
Summary: RLI and secondary index should consider only pruned partitions for 
file skipping  (was: RLI should consider only pruned partitions for file 
skipping)

> RLI and secondary index should consider only pruned partitions for file 
> skipping
> 
>
> Key: HUDI-7841
> URL: https://issues.apache.org/jira/browse/HUDI-7841
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
> Fix For: 1.0.0
>
>
> Even though RLI scans only matching files, it tries to get those candidate 
> files by iterating over all files from file index. See - 
> [https://github.com/apache/hudi/blob/f4be74c29471fbd6afff472f8db292e6b1f16f05/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala#L47]
> Instead, it can use the `prunedPartitionsAndFileSlices` to only consider 
> pruned partitions whenever there is a partition predicate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7841) RLI should consider only pruned partitions for file skipping

2024-06-07 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-7841:
-

Assignee: Lokesh Jain

> RLI should consider only pruned partitions for file skipping
> 
>
> Key: HUDI-7841
> URL: https://issues.apache.org/jira/browse/HUDI-7841
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
> Fix For: 1.0.0
>
>
> Even though RLI scans only matching files, it tries to get those candidate 
> files by iterating over all files from file index. See - 
> [https://github.com/apache/hudi/blob/f4be74c29471fbd6afff472f8db292e6b1f16f05/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala#L47]
> Instead, it can use the `prunedPartitionsAndFileSlices` to only consider 
> pruned partitions whenever there is a partition predicate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11413:
URL: https://github.com/apache/hudi/pull/11413#issuecomment-2155239340

   
   ## CI report:
   
   * d581b2726ba5047c9e72396820da81ecf1357266 UNKNOWN
   * 4e6335c7cfb18881776d572954558a41aa33b91d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24273)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6787] Implement the HoodieFileGroupReader API for Hive [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #10422:
URL: https://github.com/apache/hudi/pull/10422#issuecomment-2155237282

   
   ## CI report:
   
   * 99517e23baa60a6a0602e9daf7f522f3c1dcfa1e UNKNOWN
   * b29ff638867f3760156318bb58a7677c67a415dc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24274)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] Intermittent stall of S3 PUT request for about 17 minutes [hudi]

2024-06-07 Thread via GitHub



gudladona commented on issue #11203:
URL: https://github.com/apache/hudi/issues/11203#issuecomment-2155201212

   Assessment and workaround  provided here: 
https://github.com/aws/aws-sdk-java/issues/3110


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-4705) Support Write-on-compaction mode when query cdc on MOR tables

2024-06-07 Thread Shiyan Xu (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853214#comment-17853214
 ] 

Shiyan Xu commented on HUDI-4705:
-

[~lizhiqiang] [~biyan900...@gmail.com] to clarify, CDC for spark works on MOR, 
just that the implementation is using write-on-indexing strategy (ref: 
[https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md#persisting-cdc-in-mor-write-on-indexing-vs-write-on-compaction)]

 

We want to unify the implementation as write-on-compaction, which allows flink 
writer to work too. (write-on-indexing strategy does not work for flink as 
explained in the RFC)

> Support Write-on-compaction mode when query cdc on MOR tables
> -
>
> Key: HUDI-4705
> URL: https://issues.apache.org/jira/browse/HUDI-4705
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: compaction, spark, table-service
>Reporter: Yann Byron
>Priority: Major
>
> For the case that query cdc on MOR tables, the initial implementation use the 
> `Write-on-indexing`  way to extract the cdc data by merging the base file and 
> log files in-flight.
> This ticket wants to support the `Write-on-compaction` way to get the cdc 
> data just by reading the persisted cdc files which are written at the 
> compaction operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-6787] Implement the HoodieFileGroupReader API for Hive [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #10422:
URL: https://github.com/apache/hudi/pull/10422#issuecomment-2155176602

   
   ## CI report:
   
   * 99517e23baa60a6a0602e9daf7f522f3c1dcfa1e UNKNOWN
   * 18fbd92eec10c49025db364be79cc9dbfccee362 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24162)
 
   * b29ff638867f3760156318bb58a7677c67a415dc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24274)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6787] Implement the HoodieFileGroupReader API for Hive [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #10422:
URL: https://github.com/apache/hudi/pull/10422#issuecomment-2155165419

   
   ## CI report:
   
   * 99517e23baa60a6a0602e9daf7f522f3c1dcfa1e UNKNOWN
   * 18fbd92eec10c49025db364be79cc9dbfccee362 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24162)
 
   * b29ff638867f3760156318bb58a7677c67a415dc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT]Hudi Deltastreamer compaction is taking longer duration [hudi]

2024-06-07 Thread via GitHub



ad1happy2go commented on issue #11273:
URL: https://github.com/apache/hudi/issues/11273#issuecomment-2155139922

   @SuneethaYamani 
https://hudi.apache.org/docs/configurations/#hoodiemetadataenable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] - Partial update of the MOR table after compaction with Hudi Streamer [hudi]

2024-06-07 Thread via GitHub



ad1happy2go commented on issue #11348:
URL: https://github.com/apache/hudi/issues/11348#issuecomment-2155137100

   @kirillklimenko We will look into it. Thanks for the details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] duplicated records when use insert overwrite [hudi]

2024-06-07 Thread via GitHub



ad1happy2go commented on issue #11358:
URL: https://github.com/apache/hudi/issues/11358#issuecomment-2155135667

   @njalan If the data which you are inserting has dups, then insert overwrite 
will create dups in the table.
   
   Can you please share us the timeline to look further
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11413:
URL: https://github.com/apache/hudi/pull/11413#issuecomment-2155087634

   
   ## CI report:
   
   * 1a1ca64bec2fb94acce596934dd636b77cb0aca7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24264)
 
   * d581b2726ba5047c9e72396820da81ecf1357266 UNKNOWN
   * 4e6335c7cfb18881776d572954558a41aa33b91d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24273)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11413:
URL: https://github.com/apache/hudi/pull/11413#issuecomment-2155074344

   
   ## CI report:
   
   * 1a1ca64bec2fb94acce596934dd636b77cb0aca7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24264)
 
   * d581b2726ba5047c9e72396820da81ecf1357266 UNKNOWN
   * 4e6335c7cfb18881776d572954558a41aa33b91d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11413:
URL: https://github.com/apache/hudi/pull/11413#issuecomment-2155060689

   
   ## CI report:
   
   * 1a1ca64bec2fb94acce596934dd636b77cb0aca7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24264)
 
   * d581b2726ba5047c9e72396820da81ecf1357266 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #9894:
URL: https://github.com/apache/hudi/pull/9894#issuecomment-2154949218

   
   ## CI report:
   
   * c2dec94b442920784b3914cc13b87294e734a477 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24272)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #9894:
URL: https://github.com/apache/hudi/pull/9894#issuecomment-2154856344

   
   ## CI report:
   
   * 3e8bdc41e97141b94a9f60a3450f41ad342fa45e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24271)
 
   * c2dec94b442920784b3914cc13b87294e734a477 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24272)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #9894:
URL: https://github.com/apache/hudi/pull/9894#issuecomment-2154841313

   
   ## CI report:
   
   * 3e8bdc41e97141b94a9f60a3450f41ad342fa45e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24271)
 
   * c2dec94b442920784b3914cc13b87294e734a477 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #9894:
URL: https://github.com/apache/hudi/pull/9894#issuecomment-2154824717

   
   ## CI report:
   
   * 3e8bdc41e97141b94a9f60a3450f41ad342fa45e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24271)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-7841) RLI should consider only pruned partitions for file skipping

2024-06-07 Thread Sagar Sumit (Jira)

Sagar Sumit created HUDI-7841:
-

 Summary: RLI should consider only pruned partitions for file 
skipping
 Key: HUDI-7841
 URL: https://issues.apache.org/jira/browse/HUDI-7841
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Sagar Sumit
 Fix For: 1.0.0


Even though RLI scans only matching files, it tries to get those candidate 
files by iterating over all files from file index. See - 
[https://github.com/apache/hudi/blob/f4be74c29471fbd6afff472f8db292e6b1f16f05/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala#L47]



Instead, it can use the `prunedPartitionsAndFileSlices` to only consider pruned 
partitions whenever there is a partition predicate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]

2024-06-07 Thread via GitHub



codope commented on code in PR #9894:
URL: https://github.com/apache/hudi/pull/9894#discussion_r1631121419


##
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java:
##
@@ -1382,5 +1398,35 @@ public HoodieTableMetaClient 
initTable(StorageConfiguration configuration, St
 throws IOException {
   return HoodieTableMetaClient.initTableAndGetMetaClient(configuration, 
basePath, build());
 }
+
+private void validateMergeConfigs() {

Review Comment:
   Where is this method used? 



##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java:
##
@@ -242,6 +249,11 @@ public HoodieFileGroupReaderIterator 
getClosableIterator() {
 return new HoodieFileGroupReaderIterator<>(this);
   }
 
+  public static RecordMergeMode getRecordMergeMode(Properties props) {
+String mergeMode = getStringWithAltKeys(props, 
HoodieCommonConfig.RECORD_MERGE_MODE, true).toUpperCase();

Review Comment:
   note: Setting `useDefaultValue` to true as many tests don't set record merge 
mode.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7390] fix: HoodieStreamer no longer works without --props being supplied [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11414:
URL: https://github.com/apache/hudi/pull/11414#issuecomment-2154749790

   
   ## CI report:
   
   * 3ffd431d11a16bfb032e905eceb5374d901cb6ee Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24270)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #9894:
URL: https://github.com/apache/hudi/pull/9894#issuecomment-2154747202

   
   ## CI report:
   
   * 083ea7ec0e0cb2f14fc47faff5d781a64cca3874 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24269)
 
   * 3e8bdc41e97141b94a9f60a3450f41ad342fa45e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24271)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #9894:
URL: https://github.com/apache/hudi/pull/9894#issuecomment-2154733863

   
   ## CI report:
   
   * 083ea7ec0e0cb2f14fc47faff5d781a64cca3874 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24269)
 
   * 3e8bdc41e97141b94a9f60a3450f41ad342fa45e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #9894:
URL: https://github.com/apache/hudi/pull/9894#issuecomment-2154721676

   
   ## CI report:
   
   * 083ea7ec0e0cb2f14fc47faff5d781a64cca3874 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24269)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7390] fix: HoodieStreamer no longer works without --props being supplied [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11414:
URL: https://github.com/apache/hudi/pull/11414#issuecomment-2154656824

   
   ## CI report:
   
   * 3ffd431d11a16bfb032e905eceb5374d901cb6ee Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24270)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #9894:
URL: https://github.com/apache/hudi/pull/9894#issuecomment-2154654455

   
   ## CI report:
   
   * f1ad4786aad397d5bad19d3cf68cbbb90c92d9ac Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24267)
 
   * 083ea7ec0e0cb2f14fc47faff5d781a64cca3874 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24269)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7390] fix: HoodieStreamer no longer works without --props being supplied [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #11414:
URL: https://github.com/apache/hudi/pull/11414#issuecomment-2154644253

   
   ## CI report:
   
   * 3ffd431d11a16bfb032e905eceb5374d901cb6ee UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]

2024-06-07 Thread via GitHub



hudi-bot commented on PR #9894:
URL: https://github.com/apache/hudi/pull/9894#issuecomment-2154641685

   
   ## CI report:
   
   * f1ad4786aad397d5bad19d3cf68cbbb90c92d9ac Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24267)
 
   * 083ea7ec0e0cb2f14fc47faff5d781a64cca3874 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 >

1 - 100 of 130 matches

Mail list logo