Re: [I] [SUPPORT] Serde properties missing after migrate from hivesync to gluesync [hudi]

2024-06-05 Thread via GitHub


prathit06 commented on issue #11397:
URL: https://github.com/apache/hudi/issues/11397#issuecomment-2151534901

   @danny0405 Please review : https://github.com/apache/hudi/pull/11404
   Also could you please create a jira for this so i can add it to PR, thank 
you !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] Fix missing serDe properties post migration from hiveSync to glueSync [hudi]

2024-06-05 Thread via GitHub


prathit06 opened a new pull request, #11404:
URL: https://github.com/apache/hudi/pull/11404

   ### Change Logs
   
   Add serDe properties to table DDL if missing after migration from hive sync 
to glue sync.
   More context :  https://github.com/apache/hudi/issues/11397
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._ : NA
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._ : NA
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._ : NA
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Unable to Use DynamoDB Based Lock with Hudi PySpark Job Locally [hudi]

2024-06-05 Thread via GitHub


ad1happy2go commented on issue #11391:
URL: https://github.com/apache/hudi/issues/11391#issuecomment-2151489857

   @soumilshah1995 Looks like AWS SDK bundle version conflicts with 
hudi-aws-bundle. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7834) Setup table versions to differentiate HUDI 0.16.x and 1.0-beta versions

2024-06-05 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-7834:


 Summary: Setup table versions to differentiate HUDI 0.16.x and 
1.0-beta versions
 Key: HUDI-7834
 URL: https://issues.apache.org/jira/browse/HUDI-7834
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Balaji Varadarajan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7834) Setup table versions to differentiate HUDI 0.16.x and 1.0-beta versions

2024-06-05 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-7834:


Assignee: Balaji Varadarajan

> Setup table versions to differentiate HUDI 0.16.x and 1.0-beta versions
> ---
>
> Key: HUDI-7834
> URL: https://issues.apache.org/jira/browse/HUDI-7834
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] [SUPPORT] It failed to compile raw hudi src with error "oodieTableMetadataUtil.java:[189,7] no suitable method found for collect(java.util.stream.Collector

2024-06-05 Thread via GitHub


danny0405 commented on issue #5552:
URL: https://github.com/apache/hudi/issues/5552#issuecomment-2151362100

   We did have a fix for windows OS path with special back slash, do you 
encounter any issues for complire on windows OS ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2151300459

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * 1e677e9b8b5d79cb23e85f2577407f9be840c762 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24242)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Serde properties missing after migrate from hivesync to gluesync [hudi]

2024-06-05 Thread via GitHub


danny0405 commented on issue #11397:
URL: https://github.com/apache/hudi/issues/11397#issuecomment-2151283036

   > I have fixed this for our internal use & would like to contribute the same
   
   That's great, can you share the patch with us.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] [hudi]

2024-06-05 Thread via GitHub


danny0405 commented on issue #11403:
URL: https://github.com/apache/hudi/issues/11403#issuecomment-2151279640

   I would suggest you use the 0.12.3 or 0.14.1, 0.12.1 still got some 
stability issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6787) Hive Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and RealtimeCompactedRecordReader for Hive

2024-06-05 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6787:

Reviewers: Balaji Varadarajan

> Hive Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and 
> RealtimeCompactedRecordReader for Hive
> --
>
> Key: HUDI-6787
> URL: https://issues.apache.org/jira/browse/HUDI-6787
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Jonathan Vexler
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7384) Implement writer path support for secondary index

2024-06-05 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-7384.
-
Fix Version/s: 1.0.0
   Resolution: Done

> Implement writer path support for secondary index
> -
>
> Key: HUDI-7384
> URL: https://issues.apache.org/jira/browse/HUDI-7384
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> # Basic initialization ona. existing table
>  # Handle inserts/upserts



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7405) Implement reader path support for secondary index

2024-06-05 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7405:
--
Status: In Progress  (was: Open)

> Implement reader path support for secondary index
> -
>
> Key: HUDI-7405
> URL: https://issues.apache.org/jira/browse/HUDI-7405
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7795) Fix loading of input splits from look up table reader

2024-06-05 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7795.
---
Resolution: Fixed

> Fix loading of input splits from look up table reader
> -
>
> Key: HUDI-7795
> URL: https://issues.apache.org/jira/browse/HUDI-7795
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7405) Implement reader path support for secondary index

2024-06-05 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7405:
--
Status: Patch Available  (was: In Progress)

> Implement reader path support for secondary index
> -
>
> Key: HUDI-7405
> URL: https://issues.apache.org/jira/browse/HUDI-7405
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-06-05 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7779:

Fix Version/s: 0.16.0
   1.0.0

> Guarding archival to not archive unintended commits
> ---
>
> Key: HUDI-7779
> URL: https://issues.apache.org/jira/browse/HUDI-7779
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: archiving
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.16.0, 1.0.0
>
>
> Archiving commits from active timeline could lead to data consistency issues 
> on some rarest of occasions. We should come up with proper guards to ensure 
> we do not make such unintended archival. 
>  
> Major gap which we wanted to guard is:
> if someone disabled cleaner, archival should account for data consistency 
> issues and ensure it bails out.
> We have a base guarding condition, where archival will stop at the earliest 
> commit to retain based on latest clean commit metadata. But there are few 
> other scenarios that needs to be accounted for. 
>  
> a. Keeping aside replace commits, lets dive into specifics for regular 
> commits and delta commits.
> Say user configured clean commits to 4 and archival configs to 5 and 6. after 
> t10, cleaner is supposed to clean up all file versions created at or before 
> t6. Say cleaner did not run(for whatever reason for next 5 commits). 
>     Archival will certainly be guarded until earliest commit to retain based 
> on latest clean commits. 
> Corner case to consider: 
> A savepoint was added to say t3 and later removed. and still the cleaner was 
> never re-enabled. Even though archival would have been stopped at t3 (until 
> savepoint is present),but once savepoint is removed, if archival is executed, 
> it could archive commit t3. Which means, file versions tracked at t3 is still 
> not yet cleaned by the cleaner. 
> Reasoning: 
> We are good here wrt data consistency. Up until cleaner runs next time, this 
> older file versions might be exposed to the end-user. But time travel query 
> is not intended for already cleaned up commits and hence this is not an 
> issue. None of snapshot, time travel query or incremental query will run into 
> issues as they are not supposed to poll for t3. 
> At any later point, if cleaner is re-enabled, it will take care of cleaning 
> up file versions tracked at t3 commit. Just that for interim period, some 
> older file versions might still be exposed to readers. 
>  
> b. The more tricky part is when replace commits are involved. Since replace 
> commit metadata in active timeline is what ensures the replaced file groups 
> are ignored for reads, before archiving the same, cleaner is expected to 
> clean them up fully. But are there chances that this could go wrong? 
> Corner case to consider. Lets add onto above scenario, where t3 has a 
> savepoint, and t4 is a replace commit which replaced file groups tracked in 
> t3. 
> Cleaner will skip cleaning up files tracked by t3(due to the presence of 
> savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain 
> will be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
> disabled. In this state of the timeline, if archival is executed, (since 
> t3.savepoint is removed), archival might archive t3 and t4.rc.  This could 
> lead to data duplicates as both replaced file groups and new file groups from 
> t4.rc would be exposed as valid file groups. 
>  
> In other words, if we were to summarize the different scenarios: 
> i. replaced file group is never cleaned up. 
>     - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
> ii. replaced file group is cleaned up. 
>     - ECTR is > this.rc and is good to archive.
> iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
> clean up did not happen.  After savepoint is removed, and when archival is 
> executed, we should avoid archiving the rc of interest. This is the gap we 
> don't account for as of now.
>  
> We have 3 options to go about to solve this.
> Option A: 
> Let Savepoint deletion flow take care of cleaning up the files its tracking. 
> cons:
> Savepoint's responsibility is not removing any data files. So, from a single 
> user responsibility rule, this may not be right. Also, this clean up might 
> need to do what a clean planner might actually be doing. ie. build file 
> system view, understand if its supposed to be cleaned up already, and then 
> only clean up the files which are supposed to be cleaned up. For eg, if a 
> file group has only one file slice, it should not be cleaned up and scenarios 
> like this. 
>  
> Option B:
> Since archival is the one which might cause data consistency issues, why not 
> archival do the clean up. 
>

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-06-05 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7779:

Status: In Progress  (was: Open)

> Guarding archival to not archive unintended commits
> ---
>
> Key: HUDI-7779
> URL: https://issues.apache.org/jira/browse/HUDI-7779
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: archiving
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.16.0, 1.0.0
>
>
> Archiving commits from active timeline could lead to data consistency issues 
> on some rarest of occasions. We should come up with proper guards to ensure 
> we do not make such unintended archival. 
>  
> Major gap which we wanted to guard is:
> if someone disabled cleaner, archival should account for data consistency 
> issues and ensure it bails out.
> We have a base guarding condition, where archival will stop at the earliest 
> commit to retain based on latest clean commit metadata. But there are few 
> other scenarios that needs to be accounted for. 
>  
> a. Keeping aside replace commits, lets dive into specifics for regular 
> commits and delta commits.
> Say user configured clean commits to 4 and archival configs to 5 and 6. after 
> t10, cleaner is supposed to clean up all file versions created at or before 
> t6. Say cleaner did not run(for whatever reason for next 5 commits). 
>     Archival will certainly be guarded until earliest commit to retain based 
> on latest clean commits. 
> Corner case to consider: 
> A savepoint was added to say t3 and later removed. and still the cleaner was 
> never re-enabled. Even though archival would have been stopped at t3 (until 
> savepoint is present),but once savepoint is removed, if archival is executed, 
> it could archive commit t3. Which means, file versions tracked at t3 is still 
> not yet cleaned by the cleaner. 
> Reasoning: 
> We are good here wrt data consistency. Up until cleaner runs next time, this 
> older file versions might be exposed to the end-user. But time travel query 
> is not intended for already cleaned up commits and hence this is not an 
> issue. None of snapshot, time travel query or incremental query will run into 
> issues as they are not supposed to poll for t3. 
> At any later point, if cleaner is re-enabled, it will take care of cleaning 
> up file versions tracked at t3 commit. Just that for interim period, some 
> older file versions might still be exposed to readers. 
>  
> b. The more tricky part is when replace commits are involved. Since replace 
> commit metadata in active timeline is what ensures the replaced file groups 
> are ignored for reads, before archiving the same, cleaner is expected to 
> clean them up fully. But are there chances that this could go wrong? 
> Corner case to consider. Lets add onto above scenario, where t3 has a 
> savepoint, and t4 is a replace commit which replaced file groups tracked in 
> t3. 
> Cleaner will skip cleaning up files tracked by t3(due to the presence of 
> savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain 
> will be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
> disabled. In this state of the timeline, if archival is executed, (since 
> t3.savepoint is removed), archival might archive t3 and t4.rc.  This could 
> lead to data duplicates as both replaced file groups and new file groups from 
> t4.rc would be exposed as valid file groups. 
>  
> In other words, if we were to summarize the different scenarios: 
> i. replaced file group is never cleaned up. 
>     - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
> ii. replaced file group is cleaned up. 
>     - ECTR is > this.rc and is good to archive.
> iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
> clean up did not happen.  After savepoint is removed, and when archival is 
> executed, we should avoid archiving the rc of interest. This is the gap we 
> don't account for as of now.
>  
> We have 3 options to go about to solve this.
> Option A: 
> Let Savepoint deletion flow take care of cleaning up the files its tracking. 
> cons:
> Savepoint's responsibility is not removing any data files. So, from a single 
> user responsibility rule, this may not be right. Also, this clean up might 
> need to do what a clean planner might actually be doing. ie. build file 
> system view, understand if its supposed to be cleaned up already, and then 
> only clean up the files which are supposed to be cleaned up. For eg, if a 
> file group has only one file slice, it should not be cleaned up and scenarios 
> like this. 
>  
> Option B:
> Since archival is the one which might cause data consistency issues, why not 
> archival do the clean up. 
> We need to ac

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-06-05 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7779:

Sprint: 2024/06/03-16

> Guarding archival to not archive unintended commits
> ---
>
> Key: HUDI-7779
> URL: https://issues.apache.org/jira/browse/HUDI-7779
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: archiving
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.16.0, 1.0.0
>
>
> Archiving commits from active timeline could lead to data consistency issues 
> on some rarest of occasions. We should come up with proper guards to ensure 
> we do not make such unintended archival. 
>  
> Major gap which we wanted to guard is:
> if someone disabled cleaner, archival should account for data consistency 
> issues and ensure it bails out.
> We have a base guarding condition, where archival will stop at the earliest 
> commit to retain based on latest clean commit metadata. But there are few 
> other scenarios that needs to be accounted for. 
>  
> a. Keeping aside replace commits, lets dive into specifics for regular 
> commits and delta commits.
> Say user configured clean commits to 4 and archival configs to 5 and 6. after 
> t10, cleaner is supposed to clean up all file versions created at or before 
> t6. Say cleaner did not run(for whatever reason for next 5 commits). 
>     Archival will certainly be guarded until earliest commit to retain based 
> on latest clean commits. 
> Corner case to consider: 
> A savepoint was added to say t3 and later removed. and still the cleaner was 
> never re-enabled. Even though archival would have been stopped at t3 (until 
> savepoint is present),but once savepoint is removed, if archival is executed, 
> it could archive commit t3. Which means, file versions tracked at t3 is still 
> not yet cleaned by the cleaner. 
> Reasoning: 
> We are good here wrt data consistency. Up until cleaner runs next time, this 
> older file versions might be exposed to the end-user. But time travel query 
> is not intended for already cleaned up commits and hence this is not an 
> issue. None of snapshot, time travel query or incremental query will run into 
> issues as they are not supposed to poll for t3. 
> At any later point, if cleaner is re-enabled, it will take care of cleaning 
> up file versions tracked at t3 commit. Just that for interim period, some 
> older file versions might still be exposed to readers. 
>  
> b. The more tricky part is when replace commits are involved. Since replace 
> commit metadata in active timeline is what ensures the replaced file groups 
> are ignored for reads, before archiving the same, cleaner is expected to 
> clean them up fully. But are there chances that this could go wrong? 
> Corner case to consider. Lets add onto above scenario, where t3 has a 
> savepoint, and t4 is a replace commit which replaced file groups tracked in 
> t3. 
> Cleaner will skip cleaning up files tracked by t3(due to the presence of 
> savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain 
> will be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
> disabled. In this state of the timeline, if archival is executed, (since 
> t3.savepoint is removed), archival might archive t3 and t4.rc.  This could 
> lead to data duplicates as both replaced file groups and new file groups from 
> t4.rc would be exposed as valid file groups. 
>  
> In other words, if we were to summarize the different scenarios: 
> i. replaced file group is never cleaned up. 
>     - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
> ii. replaced file group is cleaned up. 
>     - ECTR is > this.rc and is good to archive.
> iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
> clean up did not happen.  After savepoint is removed, and when archival is 
> executed, we should avoid archiving the rc of interest. This is the gap we 
> don't account for as of now.
>  
> We have 3 options to go about to solve this.
> Option A: 
> Let Savepoint deletion flow take care of cleaning up the files its tracking. 
> cons:
> Savepoint's responsibility is not removing any data files. So, from a single 
> user responsibility rule, this may not be right. Also, this clean up might 
> need to do what a clean planner might actually be doing. ie. build file 
> system view, understand if its supposed to be cleaned up already, and then 
> only clean up the files which are supposed to be cleaned up. For eg, if a 
> file group has only one file slice, it should not be cleaned up and scenarios 
> like this. 
>  
> Option B:
> Since archival is the one which might cause data consistency issues, why not 
> archival do the clean up. 
> We need to account for c

[I] [SUPPORT] [hudi]

2024-06-05 Thread via GitHub


zaminhassnain06 opened a new issue, #11403:
URL: https://github.com/apache/hudi/issues/11403

   Hi
   Our organization is migrating from Hudi 0.6.0 to Hudi 0.12.1 and also 
updating the required spark and EMR versions. Our existing data sets (100s of 
TBs of data on S3) are written using Hudi 0.6.0.
   
   The latest version of Hudi has come way since 0.6.0, we are not sure about 
how to use 0.12.1 directly.
   
   Could someone provide the steps for upgrading from 0.6.0 to 0.12.1?
   
   Do we have to rebuild our tables, we are more concerned about this as tables 
are having billions of records ?
   
   Should we expect following imporvements after the upgrade: 
 – faster upserts
   
– columns add/modify (schema evolution)
   
– clustering
   
– possible solution for storing history of updates performed on recrods
   
   Thanks,
   Zamin Hassnain


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2151246128

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * e710020df011ae0e9aac4284126dbc226533e6d5 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24238)
 
   * 1e677e9b8b5d79cb23e85f2577407f9be840c762 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24242)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-05 Thread via GitHub


yihua commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1628627195


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala:
##
@@ -73,16 +84,27 @@ class SparkFileFormatInternalRowReaderContext(readerMaps: 
mutable.Map[Long, Part
   }
 }).asInstanceOf[ClosableIterator[InternalRow]]
 } else {
-  val schemaPairHashKey = generateSchemaPairHashKey(dataSchema, 
requiredSchema)
-  if (!readerMaps.contains(schemaPairHashKey)) {
-throw new IllegalStateException("schemas don't hash to a known reader")
-  }
-  new 
CloseableInternalRowIterator(readerMaps(schemaPairHashKey).apply(fileInfo))
+  // partition value is empty because the spark parquet reader will append 
the partition columns to
+  // each row if they are given. That is the only usage of the partition 
values in the reader.
+  val fileInfo = sparkAdapter.getSparkPartitionedFileUtils
+.createPartitionedFile(InternalRow.empty, filePath, start, length)
+  val (readSchema, readFilters) = getSchemaAndFiltersForRead(structType)
+  new CloseableInternalRowIterator(parquetFileReader.read(fileInfo,
+readSchema, StructType(Seq.empty), readFilters, 
storage.getConf.asInstanceOf[StorageConfiguration[Configuration]]))
 }
   }
 
-  private def generateSchemaPairHashKey(dataSchema: Schema, requestedSchema: 
Schema): Long = {
-dataSchema.hashCode() + requestedSchema.hashCode()
+  private def getSchemaAndFiltersForRead(structType: StructType): (StructType, 
Seq[Filter]) = {
+(getHasLogFiles, getNeedsBootstrapMerge, getUseRecordPosition) match {

Review Comment:
   The controlling flag looks incorrect: `shouldUseRecordPosition` controls the 
merging based on record positions from the log files, not whether to read 
record positions from the parquet file with the Spark 3.5 parquet reader (along 
with filter pushdown).  Only in Spark 3.5, when reading from the parquet base 
file, the reader should fetch the positions from the Spark parquet row index 
meta column, instead of counting the position inside Hudi.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2151233755

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * e710020df011ae0e9aac4284126dbc226533e6d5 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24238)
 
   * 1e677e9b8b5d79cb23e85f2577407f9be840c762 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-05 Thread via GitHub


yihua commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1628612267


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieHadoopFsRelationFactory.scala:
##
@@ -161,15 +167,14 @@ abstract class HoodieBaseHadoopFsRelationFactory(val 
sqlContext: SQLContext,
 val shouldExtractPartitionValueFromPath =
   
optParams.getOrElse(DataSourceReadOptions.EXTRACT_PARTITION_VALUES_FROM_PARTITION_PATH.key,
 
DataSourceReadOptions.EXTRACT_PARTITION_VALUES_FROM_PARTITION_PATH.defaultValue.toString).toBoolean
-val shouldUseBootstrapFastRead = 
optParams.getOrElse(DATA_QUERIES_ONLY.key(), "false").toBoolean
-
-shouldOmitPartitionColumns || shouldExtractPartitionValueFromPath || 
shouldUseBootstrapFastRead
+shouldOmitPartitionColumns || shouldExtractPartitionValueFromPath
   }
 
   protected lazy val mandatoryFieldsForMerging: Seq[String] =
 Seq(recordKeyField) ++ preCombineFieldOpt.map(Seq(_)).getOrElse(Seq())
 
-  protected lazy val shouldUseRecordPosition: Boolean = 
checkIfAConfigurationEnabled(HoodieReaderConfig.MERGE_USE_RECORD_POSITIONS)
+  //feature added in spark 3.5
+  protected lazy val shouldUseRecordPosition: Boolean = 
checkIfAConfigurationEnabled(HoodieReaderConfig.MERGE_USE_RECORD_POSITIONS) && 
HoodieSparkUtils.gteqSpark3_5

Review Comment:
   We can still merge deletes and updates based on record positions encoded in 
the log block headers regardless of Spark versions, correct? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-05 Thread via GitHub


yihua commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1628612267


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieHadoopFsRelationFactory.scala:
##
@@ -161,15 +167,14 @@ abstract class HoodieBaseHadoopFsRelationFactory(val 
sqlContext: SQLContext,
 val shouldExtractPartitionValueFromPath =
   
optParams.getOrElse(DataSourceReadOptions.EXTRACT_PARTITION_VALUES_FROM_PARTITION_PATH.key,
 
DataSourceReadOptions.EXTRACT_PARTITION_VALUES_FROM_PARTITION_PATH.defaultValue.toString).toBoolean
-val shouldUseBootstrapFastRead = 
optParams.getOrElse(DATA_QUERIES_ONLY.key(), "false").toBoolean
-
-shouldOmitPartitionColumns || shouldExtractPartitionValueFromPath || 
shouldUseBootstrapFastRead
+shouldOmitPartitionColumns || shouldExtractPartitionValueFromPath
   }
 
   protected lazy val mandatoryFieldsForMerging: Seq[String] =
 Seq(recordKeyField) ++ preCombineFieldOpt.map(Seq(_)).getOrElse(Seq())
 
-  protected lazy val shouldUseRecordPosition: Boolean = 
checkIfAConfigurationEnabled(HoodieReaderConfig.MERGE_USE_RECORD_POSITIONS)
+  //feature added in spark 3.5
+  protected lazy val shouldUseRecordPosition: Boolean = 
checkIfAConfigurationEnabled(HoodieReaderConfig.MERGE_USE_RECORD_POSITIONS) && 
HoodieSparkUtils.gteqSpark3_5

Review Comment:
   We can still merge deletes and updates based on record positions encoded in 
the log block headers, correct? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch asf-site updated: [HUDI-4967][HUDI-4834] Improve docs for hive sync and glue sync (#11402)

2024-06-05 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 41d1021f8a7 [HUDI-4967][HUDI-4834] Improve docs for hive sync and glue 
sync (#11402)
41d1021f8a7 is described below

commit 41d1021f8a70f9c2f2bdc049e514510b4ea1053e
Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Wed Jun 5 20:05:18 2024 -0500

[HUDI-4967][HUDI-4834] Improve docs for hive sync and glue sync (#11402)
---
 website/docs/syncing_aws_glue_data_catalog.md |  51 +-
 website/docs/syncing_metastore.md | 235 ++
 2 files changed, 176 insertions(+), 110 deletions(-)

diff --git a/website/docs/syncing_aws_glue_data_catalog.md 
b/website/docs/syncing_aws_glue_data_catalog.md
index e54c6d52887..b6f6c82a6c5 100644
--- a/website/docs/syncing_aws_glue_data_catalog.md
+++ b/website/docs/syncing_aws_glue_data_catalog.md
@@ -7,22 +7,61 @@ Hudi tables can sync to AWS Glue Data Catalog directly via 
AWS SDK. Piggyback on
 , `org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool` makes use of all the 
configurations that are taken by `HiveSyncTool`
 and send them to AWS Glue.
 
-### Configurations
+## Configurations
 
-There is no additional configuration for using `AwsGlueCatalogSyncTool`; you 
just need to set it as one of the sync tool
-classes for `HoodieStreamer` and everything configured as shown in [Sync to 
Hive Metastore](syncing_metastore) will
-be passed along.
+Most of the configurations for `AwsGlueCatalogSyncTool` are shared with 
`HiveSyncTool`. The example showed in 
+[Sync to Hive Metastore](syncing_metastore) can be used as is for sync with 
Glue Data Catalog, provided that the hive metastore
+URL (either JDBC or thrift URI) can proxied to Glue Data Catalog, which is 
usually done within AWS EMR or Glue job environment.
+
+For Hudi streamer, users can set
 
 ```shell
 --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
 ```
 
- Running AWS Glue Catalog Sync for Spark DataSource
+For Spark data source writers, users can set
+
+```shell
+hoodie.meta.sync.client.tool.class=org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
+```
+
+### Avoid creating excessive versions
+
+Tables stored in Glue Data Catalog are versioned. And by default, every Hudi 
commit triggers a sync operation if enabled, regardless of having relevant 
metadata changes.
+This can lead to too many versions kept in the catalog and eventually failing 
the sync operation.
+
+Meta-sync can be set to conditional - only sync when there are schema change 
or partition change. This can avoid creating
+excessive versions in the catalog. Users can enable it by setting 
+
+```
+hoodie.datasource.meta_sync.condition.sync=true
+```
+
+### Glue Data Catalog specific configs
+
+Sync to Glue Data Catalog can be optimized with other configs like
+
+```
+hoodie.datasource.meta.sync.glue.all_partitions_read_parallelism
+hoodie.datasource.meta.sync.glue.changed_partitions_read_parallelism
+hoodie.datasource.meta.sync.glue.partition_change_parallelism
+```
+
+[Partition 
indexes](https://docs.aws.amazon.com/glue/latest/dg/partition-indexes.html) can 
also be used by setting
+
+```
+hoodie.datasource.meta.sync.glue.partition_index_fields.enable
+hoodie.datasource.meta.sync.glue.partition_index_fields
+```
+
+## Other references
+
+### Running AWS Glue Catalog Sync for Spark DataSource
 
 To write a Hudi table to Amazon S3 and catalog it in AWS Glue Data Catalog, 
you can use the options mentioned in the
 [AWS 
documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html#aws-glue-programming-etl-format-hudi-write)
 
- Running AWS Glue Catalog Sync from EMR
+### Running AWS Glue Catalog Sync from EMR
 
 If you're running HiveSyncTool on an EMR cluster backed by Glue Data Catalog 
as external metastore, you can simply run the sync from command line like below:
 
diff --git a/website/docs/syncing_metastore.md 
b/website/docs/syncing_metastore.md
index e39c5f39337..2aada772a6a 100644
--- a/website/docs/syncing_metastore.md
+++ b/website/docs/syncing_metastore.md
@@ -10,6 +10,118 @@ Hive metastore as well. This unlocks the capability to 
query Hudi tables not onl
 interactive query engines such as Presto and Trino. In this document, we will 
go through different ways to sync the Hudi
 table to Hive metastore.
 
+## Spark Data Source example
+
+Prerequisites: setup hive metastore properly and configure the Spark 
installation to point to the hive metastore by placing `hive-site.xml` under 
`$SPARK_HOME/conf`
+
+Assume that
+  - hiveserver2 is running at port 1
+  - metastore is running at port 9083
+
+Then start a spark-shell with Hudi spark bundle jar as a dependency (refer to 
Quickstart example)
+
+We can run the following script to create a sample hudi table and sync it to 
hive

Re: [PR] [HUDI-4967][HUDI-4834] Improve docs for hive sync and glue sync [hudi]

2024-06-05 Thread via GitHub


xushiyan merged PR #11402:
URL: https://github.com/apache/hudi/pull/11402


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-6633) Add hms based sync to hudi website

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu closed HUDI-6633.
---
Resolution: Fixed

> Add hms based sync to hudi website
> --
>
> Key: HUDI-6633
> URL: https://issues.apache.org/jira/browse/HUDI-6633
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: sivabalan narayanan
>Assignee: Shiyan Xu
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
>
> we should add hms based sync to our hive sync page 
> [https://hudi.apache.org/docs/syncing_metastore]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4967) Improve docs for meta sync with TimestampBasedKeyGenerator

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu updated HUDI-4967:

Status: Open  (was: Patch Available)

> Improve docs for meta sync with TimestampBasedKeyGenerator
> --
>
> Key: HUDI-4967
> URL: https://issues.apache.org/jira/browse/HUDI-4967
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Shiyan Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> Related fix: HUDI-4966
> We need to add docs on how to properly set the meta sync configuration, 
> especially the hoodie.datasource.hive_sync.partition_value_extractor, in 
> [https://hudi.apache.org/docs/key_generation] (for different Hudi versions, 
> the config can be different).  Check the ticket above and PR description of 
> [https://github.com/apache/hudi/pull/6851] for more details.
> We should also add the migration setup on the key generation page as well: 
> [https://hudi.apache.org/releases/release-0.12.0/#configuration-updates]
>  * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config 
> is used to extract and transform partition value during Hive sync. Its 
> default value has been changed from 
> {{SlashEncodedDayPartitionValueExtractor}} to 
> {{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default 
> value (i.e., have not set it explicitly), you are required to set the config 
> to {{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From 
> this release, if this config is not set and Hive sync is enabled, then 
> partition value extractor class will be *automatically inferred* on the basis 
> of number of partition fields and whether or not hive style partitioning is 
> enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-4834) Update AWSGlueCatalog syncing oage to add spark datasource example

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu closed HUDI-4834.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

> Update AWSGlueCatalog syncing oage to add spark datasource example
> --
>
> Key: HUDI-4834
> URL: https://issues.apache.org/jira/browse/HUDI-4834
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs
>Reporter: Bhavani Sudha
>Assignee: Shiyan Xu
>Priority: Minor
>  Labels: documentation
> Fix For: 0.15.0, 1.0.0
>
>
> [https://hudi.apache.org/docs/next/syncing_aws_glue_data_catalog] this page 
> specifically talks about how to leverage this syncing mechanism via 
> Deltastreamer. We also need example for spark datasource here. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-4967) Improve docs for meta sync with TimestampBasedKeyGenerator

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu closed HUDI-4967.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

> Improve docs for meta sync with TimestampBasedKeyGenerator
> --
>
> Key: HUDI-4967
> URL: https://issues.apache.org/jira/browse/HUDI-4967
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Shiyan Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Related fix: HUDI-4966
> We need to add docs on how to properly set the meta sync configuration, 
> especially the hoodie.datasource.hive_sync.partition_value_extractor, in 
> [https://hudi.apache.org/docs/key_generation] (for different Hudi versions, 
> the config can be different).  Check the ticket above and PR description of 
> [https://github.com/apache/hudi/pull/6851] for more details.
> We should also add the migration setup on the key generation page as well: 
> [https://hudi.apache.org/releases/release-0.12.0/#configuration-updates]
>  * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config 
> is used to extract and transform partition value during Hive sync. Its 
> default value has been changed from 
> {{SlashEncodedDayPartitionValueExtractor}} to 
> {{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default 
> value (i.e., have not set it explicitly), you are required to set the config 
> to {{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From 
> this release, if this config is not set and Hive sync is enabled, then 
> partition value extractor class will be *automatically inferred* on the basis 
> of number of partition fields and whether or not hive style partitioning is 
> enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-05 Thread via GitHub


yihua commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1628560599


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordReader.java:
##
@@ -343,19 +310,19 @@ public Builder 
withRecordBuffer(HoodieFileGroupRecordBuffer recordBuffer)
 
 @Override
 public HoodieMergedLogRecordReader build() {
+  ValidationUtils.checkArgument(recordMerger != null);
+  ValidationUtils.checkArgument(recordBuffer != null);
+  ValidationUtils.checkArgument(readerContext != null);

Review Comment:
   Add error message to the validation.



##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java:
##
@@ -285,8 +324,8 @@ protected Option merge(Option older, Map olderInfoMap,
*  1. A set of pre-specified keys exists.
*  2. The key of the record is not contained in the set.
*/
-  protected boolean shouldSkip(T record, String keyFieldName, boolean 
isFullKey, Set keys) {
-String recordKey = readerContext.getValue(record, readerSchema, 
keyFieldName).toString();
+  protected boolean shouldSkip(T record, String keyFieldName, boolean 
isFullKey, Set keys, Schema dataBlockSchema) {

Review Comment:
   Is `dataBlockSchema` the writer schema?  Rename it as `writerSchema`?



##
hudi-common/src/main/java/org/apache/hudi/internal/schema/convert/AvroInternalSchemaConverter.java:
##
@@ -73,6 +77,75 @@ public static Schema convert(InternalSchema internalSchema, 
String name) {
 return buildAvroSchemaFromInternalSchema(internalSchema, name);
   }
 
+  public static InternalSchema pruneAvroSchemaToInternalSchema(Schema schema, 
InternalSchema originSchema) {

Review Comment:
   To clarify, is this only used for internal schema?  Does schema evolution 
incur record conversion between Row and Avro records (which should be avoided 
as much as possible)?



##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java:
##
@@ -275,6 +311,9 @@ protected Option merge(Option older, Map olderInfoMap,
 
 if (mergedRecord.isPresent()
 && 
!mergedRecord.get().getLeft().isDelete(mergedRecord.get().getRight(), 
payloadProps)) {
+  if (!mergedRecord.get().getRight().equals(readerSchema)) {
+return Option.ofNullable((T) 
mergedRecord.get().getLeft().rewriteRecordWithNewSchema(mergedRecord.get().getRight(),
 null, readerSchema).getData());

Review Comment:
   Do partial updates need schema evolution handling like this?



##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java:
##
@@ -242,7 +250,44 @@ protected Pair, Schema> 
getRecordsIterator(HoodieDataBlock d
 } else {
   blockRecordsIterator = dataBlock.getEngineRecordIterator(readerContext);
 }
-return Pair.of(blockRecordsIterator, dataBlock.getSchema());
+Option, Schema>> schemaEvolutionTransformerOpt =
+composeEvolvedSchemaTransformer(dataBlock);
+
+// In case when schema has been evolved original persisted records will 
have to be
+// transformed to adhere to the new schema
+Function transformer =
+schemaEvolutionTransformerOpt.map(Pair::getLeft)
+.orElse(Function.identity());
+
+Schema schema = schemaEvolutionTransformerOpt.map(Pair::getRight)
+.orElseGet(dataBlock::getSchema);
+
+return Pair.of(new CloseableMappingIterator<>(blockRecordsIterator, 
transformer), schema);
+  }
+
+  /**
+   * Get final Read Schema for support evolution.
+   * step1: find the fileSchema for current dataBlock.
+   * step2: determine whether fileSchema is compatible with the final read 
internalSchema.
+   * step3: merge fileSchema and read internalSchema to produce final read 
schema.
+   *
+   * @param dataBlock current processed block
+   * @return final read schema.
+   */
+  protected Option, Schema>> 
composeEvolvedSchemaTransformer(
+  HoodieDataBlock dataBlock) {
+if (internalSchema.isEmptySchema()) {
+  return Option.empty();
+}
+
+long currentInstantTime = 
Long.parseLong(dataBlock.getLogBlockHeader().get(INSTANT_TIME));
+InternalSchema fileSchema = 
InternalSchemaCache.searchSchemaAndCache(currentInstantTime,
+hoodieTableMetaClient, false);

Review Comment:
   @jonvex follow-up JIRA to track?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-4967][HUDI-4834] Improve docs for hive sync and glue sync [hudi]

2024-06-05 Thread via GitHub


xushiyan commented on PR #11402:
URL: https://github.com/apache/hudi/pull/11402#issuecomment-2151191188

   
![screencapture-localhost-3000-docs-next-syncing-aws-glue-data-catalog-2024-06-05-19_51_01](https://github.com/apache/hudi/assets/2701446/ca644d33-870c-4a0e-9515-e4c647fb3646)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-4967][HUDI-4834] Improve docs for hive sync and glue sync [hudi]

2024-06-05 Thread via GitHub


xushiyan commented on PR #11402:
URL: https://github.com/apache/hudi/pull/11402#issuecomment-2151190860

   
![screencapture-localhost-3000-docs-next-syncing-metastore-2024-06-05-19_52_22](https://github.com/apache/hudi/assets/2701446/81929e63-3831-45d1-8303-07f0139840b9)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-05 Thread via GitHub


jonvex commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1628598510


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala:
##
@@ -46,21 +49,27 @@ import scala.collection.mutable
  *
  * This uses Spark parquet reader to read parquet data files or parquet log 
blocks.
  *
- * @param readermaps our intention is to build the reader inside of 
getFileRecordIterator, but since it is called from
- *   the executor, we will need to port a bunch of the code 
from ParquetFileFormat for each spark version
- *   for now, we pass in a map of the different readers we 
expect to create
+ * @param parquetFileReader A reader that transforms a {@link PartitionedFile} 
to an iterator of
+ *{@link InternalRow}. This is required for reading 
the base file and
+ *not required for reading a file group with only log 
files.
+ * @param recordKeyColumn column name for the recordkey
+ * @param filters spark filters that might be pushed down into the reader
  */
-class SparkFileFormatInternalRowReaderContext(readerMaps: mutable.Map[Long, 
PartitionedFile => Iterator[InternalRow]]) extends 
BaseSparkInternalRowReaderContext {
+class SparkFileFormatInternalRowReaderContext(parquetFileReader: 
SparkParquetReader,
+  recordKeyColumn: String,
+  filters: Seq[Filter]) extends 
BaseSparkInternalRowReaderContext {
   lazy val sparkAdapter = SparkAdapterSupport.sparkAdapter
   val deserializerMap: mutable.Map[Schema, HoodieAvroDeserializer] = 
mutable.Map()
+  lazy val recordKeyFilters: Seq[Filter] = filters.filter(f => 
f.references.exists(c => c.equalsIgnoreCase(recordKeyColumn)))

Review Comment:
   https://issues.apache.org/jira/browse/HUDI-7833



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7833) Validate that fg reader works with nested column as record key

2024-06-05 Thread Jonathan Vexler (Jira)
Jonathan Vexler created HUDI-7833:
-

 Summary: Validate that fg reader works with nested column as 
record key
 Key: HUDI-7833
 URL: https://issues.apache.org/jira/browse/HUDI-7833
 Project: Apache Hudi
  Issue Type: Task
Reporter: Jonathan Vexler


Ensure that fg reader works if the record key is a nested column



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6230) Make hive sync aws support partition indexes

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu updated HUDI-6230:

Fix Version/s: 0.15.0

> Make hive sync aws support partition indexes
> 
>
> Key: HUDI-6230
> URL: https://issues.apache.org/jira/browse/HUDI-6230
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: nicolas paris
>Assignee: nicolas paris
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> glue provide indexing features, that speedup a lot partition retrieval 
> So far it is not supported. Having a new hive-sync configuration to activate 
> the feature, and optionally provide which partitions columns to index would 
> be helpful.
> Also this is an operation that should not be done at creation table time, but 
> could be activated/deactivated at will
>  
> https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#glue-best-practices-partition-index



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-1234] DO NOT MERGE use fg reader in cdc test [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11401:
URL: https://github.com/apache/hudi/pull/11401#issuecomment-2151112254

   
   ## CI report:
   
   * a4f3d9a64cc59f67bda1b9f9e045774b29213d2c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24241)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-05 Thread via GitHub


yihua commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1628543356


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala:
##
@@ -101,46 +121,150 @@ class 
SparkFileFormatInternalRowReaderContext(readerMaps: mutable.Map[Long, Part
   }
 
   override def mergeBootstrapReaders(skeletonFileIterator: 
ClosableIterator[InternalRow],
- dataFileIterator: 
ClosableIterator[InternalRow]): ClosableIterator[InternalRow] = {
-doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]],
-  dataFileIterator.asInstanceOf[ClosableIterator[Any]])
+ skeletonRequiredSchema: Schema,
+ dataFileIterator: 
ClosableIterator[InternalRow],
+ dataRequiredSchema: Schema): 
ClosableIterator[InternalRow] = {
+doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]], 
skeletonRequiredSchema,
+  dataFileIterator.asInstanceOf[ClosableIterator[Any]], dataRequiredSchema)
   }
 
-  protected def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any], 
dataFileIterator: ClosableIterator[Any]): ClosableIterator[InternalRow] = {
-new ClosableIterator[Any] {
-  val combinedRow = new JoinedRow()
-
-  override def hasNext: Boolean = {
-//If the iterators are out of sync it is probably due to filter 
pushdown
-checkState(dataFileIterator.hasNext == skeletonFileIterator.hasNext,
-  "Bootstrap data-file iterator and skeleton-file iterator have to be 
in-sync!")
-dataFileIterator.hasNext && skeletonFileIterator.hasNext
+  protected def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any],
+ skeletonRequiredSchema: Schema,
+ dataFileIterator: ClosableIterator[Any],
+ dataRequiredSchema: Schema): 
ClosableIterator[InternalRow] = {
+if (getUseRecordPosition) {
+  assert(AvroSchemaUtils.containsFieldInSchema(skeletonRequiredSchema, 
ROW_INDEX_TEMPORARY_COLUMN_NAME))
+  assert(AvroSchemaUtils.containsFieldInSchema(dataRequiredSchema, 
ROW_INDEX_TEMPORARY_COLUMN_NAME))
+  val javaSet = new java.util.HashSet[String]()
+  javaSet.add(ROW_INDEX_TEMPORARY_COLUMN_NAME)
+  val skeletonProjection = projectRecord(skeletonRequiredSchema,
+AvroSchemaUtils.removeFieldsFromSchema(skeletonRequiredSchema, 
javaSet))
+  //If we have log files, we will want to do position based merging with 
those as well,
+  //so leave the row index column at the end
+  val dataProjection = if (getHasLogFiles) {
+getIdentityProjection
+  } else {
+projectRecord(dataRequiredSchema,
+  AvroSchemaUtils.removeFieldsFromSchema(dataRequiredSchema, javaSet))
   }
 
-  override def next(): Any = {
-(skeletonFileIterator.next(), dataFileIterator.next()) match {
-  case (s: ColumnarBatch, d: ColumnarBatch) =>
-val numCols = s.numCols() + d.numCols()
-val vecs: Array[ColumnVector] = new Array[ColumnVector](numCols)
-for (i <- 0 until numCols) {
-  if (i < s.numCols()) {
-vecs(i) = s.column(i)
+  //Always use internal row for positional merge because
+  //we need to iterate row by row when merging
+  new CachingIterator[InternalRow] {
+val combinedRow = new JoinedRow()
+
+//position column will always be at the end of the row
+private def getPos(row: InternalRow): Long = {
+  row.getLong(row.numFields-1)
+}
+
+private def getNextSkeleton: (InternalRow, Long) = {
+  val nextSkeletonRow = 
skeletonFileIterator.next().asInstanceOf[InternalRow]
+  (nextSkeletonRow, getPos(nextSkeletonRow))
+}
+
+private def getNextData: (InternalRow, Long) = {
+  val nextSkeletonRow = 
skeletonFileIterator.next().asInstanceOf[InternalRow]
+  (nextSkeletonRow, getPos(nextSkeletonRow))
+}
+
+override def close(): Unit = {
+  skeletonFileIterator.close()
+  dataFileIterator.close()
+}
+
+override protected def doHasNext(): Boolean = {
+  if (!dataFileIterator.hasNext || !skeletonFileIterator.hasNext) {
+false
+  } else {
+var nextSkeleton = getNextSkeleton
+var nextData = getNextData
+while (nextSkeleton._2 != nextData._2) {
+  if (nextSkeleton._2 > nextData._2) {
+if (!dataFileIterator.hasNext) {
+  return false
+} else {
+  nextData = getNextData
+}
   } else {
-vecs(i) = d.column(i - s.numCols())
+if (!skeletonFileIterator.hasNext) {
+  return fal

[jira] [Closed] (HUDI-1964) Update guide around hive metastore and hive sync for hudi tables

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu closed HUDI-1964.
---
Resolution: Duplicate

> Update guide around hive metastore and hive sync for hudi tables
> 
>
> Key: HUDI-1964
> URL: https://issues.apache.org/jira/browse/HUDI-1964
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs
>Reporter: Nishith Agarwal
>Assignee: Shiyan Xu
>Priority: Minor
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1964) Update guide around hive metastore and hive sync for hudi tables

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu updated HUDI-1964:

Fix Version/s: 1.0.0

> Update guide around hive metastore and hive sync for hudi tables
> 
>
> Key: HUDI-1964
> URL: https://issues.apache.org/jira/browse/HUDI-1964
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs
>Reporter: Nishith Agarwal
>Assignee: Shiyan Xu
>Priority: Minor
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6633) Add hms based sync to hudi website

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu updated HUDI-6633:

Fix Version/s: 1.0.0

> Add hms based sync to hudi website
> --
>
> Key: HUDI-6633
> URL: https://issues.apache.org/jira/browse/HUDI-6633
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: sivabalan narayanan
>Assignee: Shiyan Xu
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
>
> we should add hms based sync to our hive sync page 
> [https://hudi.apache.org/docs/syncing_metastore]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-851) Add Documentation on partitioning data with examples and details on how to sync to Hive

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu closed HUDI-851.
--
Fix Version/s: 1.0.0
   (was: 0.15.0)
   Resolution: Duplicate

> Add Documentation on partitioning data with examples and details on how to 
> sync to Hive
> ---
>
> Key: HUDI-851
> URL: https://issues.apache.org/jira/browse/HUDI-851
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Bhavani Sudha
>Assignee: Shiyan Xu
>Priority: Minor
>  Labels: query-eng, user-support-issues
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-1234] DO NOT MERGE use fg reader in cdc test [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11401:
URL: https://github.com/apache/hudi/pull/11401#issuecomment-2151067173

   
   ## CI report:
   
   * a4f3d9a64cc59f67bda1b9f9e045774b29213d2c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24241)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-1234] DO NOT MERGE use fg reader in cdc test [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11401:
URL: https://github.com/apache/hudi/pull/11401#issuecomment-2151058358

   
   ## CI report:
   
   * a4f3d9a64cc59f67bda1b9f9e045774b29213d2c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (a1ba9728310 -> 44922f160bd)

2024-06-05 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from a1ba9728310 [HUDI-7414] Remove redundant base path config in BQ sync 
(#11395)
 add 44922f160bd [MINOR] Allow recreation of metrics instance for base path 
(#11400)

No new revisions were added by this update.

Summary of changes:
 .../main/java/org/apache/hudi/metrics/Metrics.java |  1 +
 .../java/org/apache/hudi/metrics/TestMetrics.java  | 62 ++
 2 files changed, 63 insertions(+)
 create mode 100644 
hudi-common/src/test/java/org/apache/hudi/metrics/TestMetrics.java



Re: [PR] [MINOR] Allow recreation of metrics instance for base path [hudi]

2024-06-05 Thread via GitHub


yihua merged PR #11400:
URL: https://github.com/apache/hudi/pull/11400


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Allow recreation of metrics instance for base path [hudi]

2024-06-05 Thread via GitHub


yihua commented on PR #11400:
URL: https://github.com/apache/hudi/pull/11400#issuecomment-2151045404

   Azure CI is green.
   https://github.com/apache/hudi/assets/2497195/8e77a102-fefa-44d3-9c8d-366546204d28";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-05 Thread via GitHub


yihua commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1628436290


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala:
##
@@ -46,21 +49,27 @@ import scala.collection.mutable
  *
  * This uses Spark parquet reader to read parquet data files or parquet log 
blocks.
  *
- * @param readermaps our intention is to build the reader inside of 
getFileRecordIterator, but since it is called from
- *   the executor, we will need to port a bunch of the code 
from ParquetFileFormat for each spark version
- *   for now, we pass in a map of the different readers we 
expect to create
+ * @param parquetFileReader A reader that transforms a {@link PartitionedFile} 
to an iterator of
+ *{@link InternalRow}. This is required for reading 
the base file and
+ *not required for reading a file group with only log 
files.
+ * @param recordKeyColumn column name for the recordkey
+ * @param filters spark filters that might be pushed down into the reader
  */
-class SparkFileFormatInternalRowReaderContext(readerMaps: mutable.Map[Long, 
PartitionedFile => Iterator[InternalRow]]) extends 
BaseSparkInternalRowReaderContext {
+class SparkFileFormatInternalRowReaderContext(parquetFileReader: 
SparkParquetReader,
+  recordKeyColumn: String,
+  filters: Seq[Filter]) extends 
BaseSparkInternalRowReaderContext {
   lazy val sparkAdapter = SparkAdapterSupport.sparkAdapter
   val deserializerMap: mutable.Map[Schema, HoodieAvroDeserializer] = 
mutable.Map()
+  lazy val recordKeyFilters: Seq[Filter] = filters.filter(f => 
f.references.exists(c => c.equalsIgnoreCase(recordKeyColumn)))

Review Comment:
   @jonvex Could you create a follow-up ticket to validate this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-1234] DO NOT MERGE use fg reader in cdc test [hudi]

2024-06-05 Thread via GitHub


jonvex opened a new pull request, #11401:
URL: https://github.com/apache/hudi/pull/11401

   ### Change Logs
   use fg reader in cdc run ci
   ### Impact
   
   step in making cdc reading engine agnostic
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-7832) Refactor Deltastreamer S3/GCP Events Source to allow adding auxiliary columns from upstream.

2024-06-05 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-7832:


Assignee: Balaji Varadarajan

> Refactor Deltastreamer S3/GCP Events Source to allow adding auxiliary columns 
> from upstream. 
> -
>
> Key: HUDI-7832
> URL: https://issues.apache.org/jira/browse/HUDI-7832
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
>
> Background : [https://hudi.apache.org/blog/2021/08/23/s3-events-source/]
> This Jira is to refactor the classes associated with this feature so that we 
> can allow users to extend functionalities such as adding more columns from 
> s3_meta_table to the s3_hudi_table. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7832) Refactor Deltastreamer S3/GCP Events Source to allow adding auxiliary columns from upstream.

2024-06-05 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-7832:


 Summary: Refactor Deltastreamer S3/GCP Events Source to allow 
adding auxiliary columns from upstream. 
 Key: HUDI-7832
 URL: https://issues.apache.org/jira/browse/HUDI-7832
 Project: Apache Hudi
  Issue Type: Improvement
  Components: deltastreamer
Reporter: Balaji Varadarajan
 Fix For: 0.15.0, 1.0.0


Background : [https://hudi.apache.org/blog/2021/08/23/s3-events-source/]

This Jira is to refactor the classes associated with this feature so that we 
can allow users to extend functionalities such as adding more columns from 
s3_meta_table to the s3_hudi_table. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2150997634

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * e710020df011ae0e9aac4284126dbc226533e6d5 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24238)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Allow recreation of metrics instance for base path [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11400:
URL: https://github.com/apache/hudi/pull/11400#issuecomment-2150977762

   
   ## CI report:
   
   * 8f7123807feaa88d95dcc289364e2b8f15b43553 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24239)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-7414) Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu closed HUDI-7414.
---
Resolution: Fixed

> Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs
> ---
>
> Key: HUDI-7414
> URL: https://issues.apache.org/jira/browse/HUDI-7414
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: nadine
>Assignee: Shiyan Xu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> There was a jira issue filed where sarfaraz wanted to know more about the 
> `hoodie.gcp.bigquery.sync.base_path`.  
> In the BigQuerySyncConfig file, there a config property set: 
> [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java#L103]
>   But it’s not used anywhere else in the big query code base.
> However, I see
> [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java#L124]
>  being used to get the base path. The {{hoodie.gcp.bigquery.sync.base_path}}  
> is superfluous. I’m seeing as a config being set, but not being used 
> anywhere. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7414) Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu updated HUDI-7414:

Fix Version/s: 1.0.0
   (was: 0.15.0)

> Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs
> ---
>
> Key: HUDI-7414
> URL: https://issues.apache.org/jira/browse/HUDI-7414
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: nadine
>Assignee: Shiyan Xu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> There was a jira issue filed where sarfaraz wanted to know more about the 
> `hoodie.gcp.bigquery.sync.base_path`.  
> In the BigQuerySyncConfig file, there a config property set: 
> [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java#L103]
>   But it’s not used anywhere else in the big query code base.
> However, I see
> [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java#L124]
>  being used to get the base path. The {{hoodie.gcp.bigquery.sync.base_path}}  
> is superfluous. I’m seeing as a config being set, but not being used 
> anywhere. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [MINOR] Allow recreation of metrics instance for base path [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11400:
URL: https://github.com/apache/hudi/pull/11400#issuecomment-2150916443

   
   ## CI report:
   
   * 8f7123807feaa88d95dcc289364e2b8f15b43553 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions bootstrap [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11399:
URL: https://github.com/apache/hudi/pull/11399#issuecomment-2150916361

   
   ## CI report:
   
   * 7bc15adec04d8b680ed83b532803ceef350d51a6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24237)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2150915206

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * 11862a3bd3b84cb12b0abcf8a399d2bfb56870b3 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24222)
 
   * e710020df011ae0e9aac4284126dbc226533e6d5 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24238)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2150904022

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * 11862a3bd3b84cb12b0abcf8a399d2bfb56870b3 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24222)
 
   * e710020df011ae0e9aac4284126dbc226533e6d5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [MINOR] Allow recreation of metrics instance for base path [hudi]

2024-06-05 Thread via GitHub


the-other-tim-brown opened a new pull request, #11400:
URL: https://github.com/apache/hudi/pull/11400

   ### Change Logs
   
   - Removes metrics entry from map when it is shutdown
   
   ### Impact
   
   Allows proper recreation of metrics instance if it was previously shutdown. 
This can be required by users interacting with these libraries directly
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions bootstrap [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11399:
URL: https://github.com/apache/hudi/pull/11399#issuecomment-2150893311

   
   ## CI report:
   
   * 2e1f5f9da800d048b39b5f119038191b9f277396 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24235)
 
   * 7bc15adec04d8b680ed83b532803ceef350d51a6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24237)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11398:
URL: https://github.com/apache/hudi/pull/11398#issuecomment-2150893255

   
   ## CI report:
   
   * da8e1320dc7b7e18a35319a32342f96eff646518 UNKNOWN
   * ea23061e800c02c8814d50efddf303edad448be2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24236)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-05 Thread via GitHub


jonvex commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1628362032


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala:
##
@@ -101,46 +121,150 @@ class 
SparkFileFormatInternalRowReaderContext(readerMaps: mutable.Map[Long, Part
   }
 
   override def mergeBootstrapReaders(skeletonFileIterator: 
ClosableIterator[InternalRow],
- dataFileIterator: 
ClosableIterator[InternalRow]): ClosableIterator[InternalRow] = {
-doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]],
-  dataFileIterator.asInstanceOf[ClosableIterator[Any]])
+ skeletonRequiredSchema: Schema,
+ dataFileIterator: 
ClosableIterator[InternalRow],
+ dataRequiredSchema: Schema): 
ClosableIterator[InternalRow] = {
+doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]], 
skeletonRequiredSchema,
+  dataFileIterator.asInstanceOf[ClosableIterator[Any]], dataRequiredSchema)
   }
 
-  protected def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any], 
dataFileIterator: ClosableIterator[Any]): ClosableIterator[InternalRow] = {
-new ClosableIterator[Any] {
-  val combinedRow = new JoinedRow()
-
-  override def hasNext: Boolean = {
-//If the iterators are out of sync it is probably due to filter 
pushdown
-checkState(dataFileIterator.hasNext == skeletonFileIterator.hasNext,
-  "Bootstrap data-file iterator and skeleton-file iterator have to be 
in-sync!")
-dataFileIterator.hasNext && skeletonFileIterator.hasNext
+  protected def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any],
+ skeletonRequiredSchema: Schema,
+ dataFileIterator: ClosableIterator[Any],
+ dataRequiredSchema: Schema): 
ClosableIterator[InternalRow] = {
+if (getUseRecordPosition) {
+  assert(AvroSchemaUtils.containsFieldInSchema(skeletonRequiredSchema, 
ROW_INDEX_TEMPORARY_COLUMN_NAME))
+  assert(AvroSchemaUtils.containsFieldInSchema(dataRequiredSchema, 
ROW_INDEX_TEMPORARY_COLUMN_NAME))
+  val javaSet = new java.util.HashSet[String]()
+  javaSet.add(ROW_INDEX_TEMPORARY_COLUMN_NAME)
+  val skeletonProjection = projectRecord(skeletonRequiredSchema,
+AvroSchemaUtils.removeFieldsFromSchema(skeletonRequiredSchema, 
javaSet))
+  //If we have log files, we will want to do position based merging with 
those as well,
+  //so leave the row index column at the end
+  val dataProjection = if (getHasLogFiles) {
+getIdentityProjection
+  } else {
+projectRecord(dataRequiredSchema,
+  AvroSchemaUtils.removeFieldsFromSchema(dataRequiredSchema, javaSet))
   }
 
-  override def next(): Any = {
-(skeletonFileIterator.next(), dataFileIterator.next()) match {
-  case (s: ColumnarBatch, d: ColumnarBatch) =>
-val numCols = s.numCols() + d.numCols()
-val vecs: Array[ColumnVector] = new Array[ColumnVector](numCols)
-for (i <- 0 until numCols) {
-  if (i < s.numCols()) {
-vecs(i) = s.column(i)
+  //Always use internal row for positional merge because
+  //we need to iterate row by row when merging
+  new CachingIterator[InternalRow] {
+val combinedRow = new JoinedRow()
+
+//position column will always be at the end of the row
+private def getPos(row: InternalRow): Long = {
+  row.getLong(row.numFields-1)
+}
+
+private def getNextSkeleton: (InternalRow, Long) = {
+  val nextSkeletonRow = 
skeletonFileIterator.next().asInstanceOf[InternalRow]
+  (nextSkeletonRow, getPos(nextSkeletonRow))
+}
+
+private def getNextData: (InternalRow, Long) = {
+  val nextSkeletonRow = 
skeletonFileIterator.next().asInstanceOf[InternalRow]
+  (nextSkeletonRow, getPos(nextSkeletonRow))
+}
+
+override def close(): Unit = {
+  skeletonFileIterator.close()
+  dataFileIterator.close()
+}
+
+override protected def doHasNext(): Boolean = {
+  if (!dataFileIterator.hasNext || !skeletonFileIterator.hasNext) {
+false
+  } else {
+var nextSkeleton = getNextSkeleton
+var nextData = getNextData
+while (nextSkeleton._2 != nextData._2) {
+  if (nextSkeleton._2 > nextData._2) {
+if (!dataFileIterator.hasNext) {
+  return false
+} else {
+  nextData = getNextData
+}
   } else {
-vecs(i) = d.column(i - s.numCols())
+if (!skeletonFileIterator.hasNext) {
+  return fa

Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions bootstrap [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11399:
URL: https://github.com/apache/hudi/pull/11399#issuecomment-2150809888

   
   ## CI report:
   
   * 2e1f5f9da800d048b39b5f119038191b9f277396 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24235)
 
   * 7bc15adec04d8b680ed83b532803ceef350d51a6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11398:
URL: https://github.com/apache/hudi/pull/11398#issuecomment-2150809795

   
   ## CI report:
   
   * da8e1320dc7b7e18a35319a32342f96eff646518 UNKNOWN
   * 723b5a29eb4a7f872bb4436f8d6c612edf97a4d4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24234)
 
   * ea23061e800c02c8814d50efddf303edad448be2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24236)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11398:
URL: https://github.com/apache/hudi/pull/11398#issuecomment-2150796622

   
   ## CI report:
   
   * da8e1320dc7b7e18a35319a32342f96eff646518 UNKNOWN
   * 723b5a29eb4a7f872bb4436f8d6c612edf97a4d4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24234)
 
   * ea23061e800c02c8814d50efddf303edad448be2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions bootstrap [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11399:
URL: https://github.com/apache/hudi/pull/11399#issuecomment-2150796674

   
   ## CI report:
   
   * 2e1f5f9da800d048b39b5f119038191b9f277396 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24235)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11162:
URL: https://github.com/apache/hudi/pull/11162#issuecomment-2150795998

   
   ## CI report:
   
   * b342d8f8e10f77419bf1bd0bc9f626a596ad65f9 UNKNOWN
   * 8a9986ae4b8712c0e2e700aeb40a1e4c041fde0e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24233)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11398:
URL: https://github.com/apache/hudi/pull/11398#issuecomment-2150783095

   
   ## CI report:
   
   * da8e1320dc7b7e18a35319a32342f96eff646518 UNKNOWN
   * 723b5a29eb4a7f872bb4436f8d6c612edf97a4d4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24234)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11162:
URL: https://github.com/apache/hudi/pull/11162#issuecomment-2150782235

   
   ## CI report:
   
   * b342d8f8e10f77419bf1bd0bc9f626a596ad65f9 UNKNOWN
   * 7a0a21f67d6cfc5a17cd1e04abec99dfb6fd53f1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24232)
 
   * 8a9986ae4b8712c0e2e700aeb40a1e4c041fde0e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24233)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-05 Thread via GitHub


linliu-code commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1628272787


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala:
##
@@ -101,46 +121,150 @@ class 
SparkFileFormatInternalRowReaderContext(readerMaps: mutable.Map[Long, Part
   }
 
   override def mergeBootstrapReaders(skeletonFileIterator: 
ClosableIterator[InternalRow],
- dataFileIterator: 
ClosableIterator[InternalRow]): ClosableIterator[InternalRow] = {
-doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]],
-  dataFileIterator.asInstanceOf[ClosableIterator[Any]])
+ skeletonRequiredSchema: Schema,
+ dataFileIterator: 
ClosableIterator[InternalRow],
+ dataRequiredSchema: Schema): 
ClosableIterator[InternalRow] = {
+doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]], 
skeletonRequiredSchema,
+  dataFileIterator.asInstanceOf[ClosableIterator[Any]], dataRequiredSchema)
   }
 
-  protected def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any], 
dataFileIterator: ClosableIterator[Any]): ClosableIterator[InternalRow] = {
-new ClosableIterator[Any] {
-  val combinedRow = new JoinedRow()
-
-  override def hasNext: Boolean = {
-//If the iterators are out of sync it is probably due to filter 
pushdown
-checkState(dataFileIterator.hasNext == skeletonFileIterator.hasNext,
-  "Bootstrap data-file iterator and skeleton-file iterator have to be 
in-sync!")
-dataFileIterator.hasNext && skeletonFileIterator.hasNext
+  protected def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any],
+ skeletonRequiredSchema: Schema,
+ dataFileIterator: ClosableIterator[Any],
+ dataRequiredSchema: Schema): 
ClosableIterator[InternalRow] = {
+if (getUseRecordPosition) {
+  assert(AvroSchemaUtils.containsFieldInSchema(skeletonRequiredSchema, 
ROW_INDEX_TEMPORARY_COLUMN_NAME))
+  assert(AvroSchemaUtils.containsFieldInSchema(dataRequiredSchema, 
ROW_INDEX_TEMPORARY_COLUMN_NAME))
+  val javaSet = new java.util.HashSet[String]()
+  javaSet.add(ROW_INDEX_TEMPORARY_COLUMN_NAME)
+  val skeletonProjection = projectRecord(skeletonRequiredSchema,
+AvroSchemaUtils.removeFieldsFromSchema(skeletonRequiredSchema, 
javaSet))
+  //If we have log files, we will want to do position based merging with 
those as well,
+  //so leave the row index column at the end
+  val dataProjection = if (getHasLogFiles) {
+getIdentityProjection
+  } else {
+projectRecord(dataRequiredSchema,
+  AvroSchemaUtils.removeFieldsFromSchema(dataRequiredSchema, javaSet))
   }
 
-  override def next(): Any = {
-(skeletonFileIterator.next(), dataFileIterator.next()) match {
-  case (s: ColumnarBatch, d: ColumnarBatch) =>
-val numCols = s.numCols() + d.numCols()
-val vecs: Array[ColumnVector] = new Array[ColumnVector](numCols)
-for (i <- 0 until numCols) {
-  if (i < s.numCols()) {
-vecs(i) = s.column(i)
+  //Always use internal row for positional merge because
+  //we need to iterate row by row when merging
+  new CachingIterator[InternalRow] {
+val combinedRow = new JoinedRow()
+
+//position column will always be at the end of the row
+private def getPos(row: InternalRow): Long = {
+  row.getLong(row.numFields-1)
+}
+
+private def getNextSkeleton: (InternalRow, Long) = {
+  val nextSkeletonRow = 
skeletonFileIterator.next().asInstanceOf[InternalRow]
+  (nextSkeletonRow, getPos(nextSkeletonRow))
+}
+
+private def getNextData: (InternalRow, Long) = {
+  val nextSkeletonRow = 
skeletonFileIterator.next().asInstanceOf[InternalRow]
+  (nextSkeletonRow, getPos(nextSkeletonRow))
+}
+
+override def close(): Unit = {
+  skeletonFileIterator.close()
+  dataFileIterator.close()
+}
+
+override protected def doHasNext(): Boolean = {
+  if (!dataFileIterator.hasNext || !skeletonFileIterator.hasNext) {
+false
+  } else {
+var nextSkeleton = getNextSkeleton
+var nextData = getNextData
+while (nextSkeleton._2 != nextData._2) {
+  if (nextSkeleton._2 > nextData._2) {
+if (!dataFileIterator.hasNext) {
+  return false
+} else {
+  nextData = getNextData
+}
   } else {
-vecs(i) = d.column(i - s.numCols())
+if (!skeletonFileIterator.hasNext) {
+  retu

Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11398:
URL: https://github.com/apache/hudi/pull/11398#issuecomment-2150703254

   
   ## CI report:
   
   * da8e1320dc7b7e18a35319a32342f96eff646518 UNKNOWN
   * 723b5a29eb4a7f872bb4436f8d6c612edf97a4d4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions bootstrap [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11399:
URL: https://github.com/apache/hudi/pull/11399#issuecomment-2150703293

   
   ## CI report:
   
   * 2e1f5f9da800d048b39b5f119038191b9f277396 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24235)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11162:
URL: https://github.com/apache/hudi/pull/11162#issuecomment-2150702446

   
   ## CI report:
   
   * b342d8f8e10f77419bf1bd0bc9f626a596ad65f9 UNKNOWN
   * 7a0a21f67d6cfc5a17cd1e04abec99dfb6fd53f1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24232)
 
   * 8a9986ae4b8712c0e2e700aeb40a1e4c041fde0e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions bootstrap [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11399:
URL: https://github.com/apache/hudi/pull/11399#issuecomment-2150688904

   
   ## CI report:
   
   * 2e1f5f9da800d048b39b5f119038191b9f277396 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11398:
URL: https://github.com/apache/hudi/pull/11398#issuecomment-2150688823

   
   ## CI report:
   
   * da8e1320dc7b7e18a35319a32342f96eff646518 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11162:
URL: https://github.com/apache/hudi/pull/11162#issuecomment-2150688074

   
   ## CI report:
   
   * b342d8f8e10f77419bf1bd0bc9f626a596ad65f9 UNKNOWN
   * c6d07ea56ebf1c7eaeb9306df8fe0dd366d72abe Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24217)
 
   * 7a0a21f67d6cfc5a17cd1e04abec99dfb6fd53f1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-7829) storage partition stats index can not effert in data skipping

2024-06-05 Thread Sagar Sumit (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852528#comment-17852528
 ] 

Sagar Sumit commented on HUDI-7829:
---

Thanks for creating the issue. Will take a look.

> storage partition stats index can not effert in data skipping
> -
>
> Key: HUDI-7829
> URL: https://issues.apache.org/jira/browse/HUDI-7829
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: KnightChess
>Priority: Major
> Attachments: image-2024-06-05-16-30-50-503.png, 
> image-2024-06-05-16-31-44-871.png, image-2024-06-05-16-32-02-293.png
>
>
> partition stats will not effort, the current implementation does not seem to 
> achieve the effect of partition filtering.
> - first
> in this picture, I change the ut filter to trigger partition stats index.
> !image-2024-06-05-16-30-50-503.png!
> partition_stats will not save fileName, so if reuse `CSI` logical, it will 
> throw null point in group by key
> !image-2024-06-05-16-31-44-871.png!
> !image-2024-06-05-16-32-02-293.png!
> and this will cause skip other index
>  * second
> and have a question, I am not sure this pr is use to `partition` purge like 
> physical partition col, which mean use other field min/max to get which 
> physical partitions to list fileSlice. or filter fileName like `CSI`, `RLI`.
> thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7829) storage partition stats index can not effert in data skipping

2024-06-05 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-7829:
-

Assignee: Sagar Sumit

> storage partition stats index can not effert in data skipping
> -
>
> Key: HUDI-7829
> URL: https://issues.apache.org/jira/browse/HUDI-7829
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: KnightChess
>Assignee: Sagar Sumit
>Priority: Major
> Attachments: image-2024-06-05-16-30-50-503.png, 
> image-2024-06-05-16-31-44-871.png, image-2024-06-05-16-32-02-293.png
>
>
> partition stats will not effort, the current implementation does not seem to 
> achieve the effect of partition filtering.
> - first
> in this picture, I change the ut filter to trigger partition stats index.
> !image-2024-06-05-16-30-50-503.png!
> partition_stats will not save fileName, so if reuse `CSI` logical, it will 
> throw null point in group by key
> !image-2024-06-05-16-31-44-871.png!
> !image-2024-06-05-16-32-02-293.png!
> and this will cause skip other index
>  * second
> and have a question, I am not sure this pr is use to `partition` purge like 
> physical partition col, which mean use other field min/max to get which 
> physical partitions to list fileSlice. or filter fileName like `CSI`, `RLI`.
> thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7829) storage partition stats index can not effert in data skipping

2024-06-05 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7829:
--
Fix Version/s: 1.0.0

> storage partition stats index can not effert in data skipping
> -
>
> Key: HUDI-7829
> URL: https://issues.apache.org/jira/browse/HUDI-7829
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: KnightChess
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: image-2024-06-05-16-30-50-503.png, 
> image-2024-06-05-16-31-44-871.png, image-2024-06-05-16-32-02-293.png
>
>
> partition stats will not effort, the current implementation does not seem to 
> achieve the effect of partition filtering.
> - first
> in this picture, I change the ut filter to trigger partition stats index.
> !image-2024-06-05-16-30-50-503.png!
> partition_stats will not save fileName, so if reuse `CSI` logical, it will 
> throw null point in group by key
> !image-2024-06-05-16-31-44-871.png!
> !image-2024-06-05-16-32-02-293.png!
> and this will cause skip other index
>  * second
> and have a question, I am not sure this pr is use to `partition` purge like 
> physical partition col, which mean use other field min/max to get which 
> physical partitions to list fileSlice. or filter fileName like `CSI`, `RLI`.
> thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-05 Thread via GitHub


codope commented on code in PR #11162:
URL: https://github.com/apache/hudi/pull/11162#discussion_r1628207181


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataLogRecordReader.java:
##
@@ -253,7 +253,7 @@ public HoodieMetadataLogRecordReader build() {
 }
 
 private boolean shouldUseMetadataMergedLogRecordScanner() {
-  return PARTITION_NAME_SECONDARY_INDEX.equals(partitionName);
+  return partitionName.startsWith(PARTITION_NAME_SECONDARY_INDEX_PREFIX);

Review Comment:
   note: this is main fix in the latest commit. The issue was not caught in the 
testing previously because I was only asserting count of records. I have 
improved the test. Now, we assert the secondary index records as well as check 
that file pruning happens. Please see `TestSecondaryIndexPruning`



##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/SecondaryIndexTestBase.scala:
##
@@ -62,4 +69,54 @@ class SecondaryIndexTestBase extends 
HoodieSparkClientTestBase {
 cleanupResources()
   }
 
+  def verifyQueryPredicate(hudiOpts: Map[String, String], columnName: String): 
Unit = {
+mergedDfList = 
spark.read.format("hudi").options(hudiOpts).load(basePath).repartition(1).cache()
 :: mergedDfList
+val secondaryKey = mergedDfList.last.limit(1).collect().map(row => 
row.getAs(columnName).toString)
+val dataFilter = EqualTo(attribute(columnName), Literal(secondaryKey(0)))
+verifyFilePruning(hudiOpts, dataFilter)
+  }
+
+  private def attribute(partition: String): AttributeReference = {
+AttributeReference(partition, StringType, nullable = true)()
+  }
+
+
+  private def verifyFilePruning(opts: Map[String, String], dataFilter: 
Expression): Unit = {

Review Comment:
   note: the method here and below just help with verifying file pruning 
happens correctly i.e. with data skipping enabled, filtered file count < data 
files in latest snapshot, and with data skipping disabled they should be equal.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-05 Thread via GitHub


codope commented on code in PR #11162:
URL: https://github.com/apache/hudi/pull/11162#discussion_r1628207181


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataLogRecordReader.java:
##
@@ -253,7 +253,7 @@ public HoodieMetadataLogRecordReader build() {
 }
 
 private boolean shouldUseMetadataMergedLogRecordScanner() {
-  return PARTITION_NAME_SECONDARY_INDEX.equals(partitionName);
+  return partitionName.startsWith(PARTITION_NAME_SECONDARY_INDEX_PREFIX);

Review Comment:
   note to reviewer: this is amin fix in the latest commit. The issue was not 
caught in the testing previously because I was only asserting count of records. 
I have improved the test. Now, we assert the secondary index records as well as 
check that file pruning happens. Please see `TestSecondaryIndexPruning`



##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/SecondaryIndexTestBase.scala:
##
@@ -62,4 +69,54 @@ class SecondaryIndexTestBase extends 
HoodieSparkClientTestBase {
 cleanupResources()
   }
 
+  def verifyQueryPredicate(hudiOpts: Map[String, String], columnName: String): 
Unit = {
+mergedDfList = 
spark.read.format("hudi").options(hudiOpts).load(basePath).repartition(1).cache()
 :: mergedDfList
+val secondaryKey = mergedDfList.last.limit(1).collect().map(row => 
row.getAs(columnName).toString)
+val dataFilter = EqualTo(attribute(columnName), Literal(secondaryKey(0)))
+verifyFilePruning(hudiOpts, dataFilter)
+  }
+
+  private def attribute(partition: String): AttributeReference = {
+AttributeReference(partition, StringType, nullable = true)()
+  }
+
+
+  private def verifyFilePruning(opts: Map[String, String], dataFilter: 
Expression): Unit = {

Review Comment:
   note to reviewer: the method here and below just help with verifying file 
pruning happens correctly i.e. with data skipping enabled, filtered file count 
< data files in latest snapshot, and with data skipping disabled they should be 
equal.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-05 Thread via GitHub


codope commented on code in PR #11162:
URL: https://github.com/apache/hudi/pull/11162#discussion_r1628193853


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java:
##
@@ -229,6 +231,13 @@ public void cancelAllJobs() {
 javaSparkContext.cancelAllJobs();
   }
 
+  @Override
+  public  O aggregate(HoodieData data, O zeroValue, 
Functions.Function2 seqOp, Functions.Function2 combOp) {
+Function2 seqOpFunc = seqOp::apply;
+Function2 combOpFunc = combOp::apply;
+return HoodieJavaRDD.getJavaRDD(data).aggregate(zeroValue, seqOpFunc, 
combOpFunc);

Review Comment:
   Please check latest commit. I have done the changes and added a test where 
we create secondary index for a field of long type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-05 Thread via GitHub


codope commented on code in PR #11162:
URL: https://github.com/apache/hudi/pull/11162#discussion_r1628190224


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSecondaryIndexWithSql.scala:
##
@@ -95,4 +97,39 @@ class TestSecondaryIndexWithSql extends 
SecondaryIndexTestBase {
   private def checkAnswer(sql: String)(expects: Seq[Any]*): Unit = {
 assertResult(expects.map(row => Row(row: 
_*)).toArray.sortBy(_.toString()))(spark.sql(sql).collect().sortBy(_.toString()))
   }
+
+  @Test
+  def testSecondaryIndexWithInFilter(): Unit = {
+if (HoodieSparkUtils.gteqSpark3_2) {
+  var hudiOpts = commonOpts
+  hudiOpts = hudiOpts + (
+DataSourceWriteOptions.TABLE_TYPE.key -> 
HoodieTableType.COPY_ON_WRITE.name(),
+DataSourceReadOptions.ENABLE_DATA_SKIPPING.key -> "true")
+
+  spark.sql(
+s"""
+   |create table $tableName (
+   |  record_key_col string,
+   |  not_record_key_col string,
+   |  partition_key_col string
+   |) using hudi
+   | options (
+   |  primaryKey ='record_key_col',
+   |  hoodie.metadata.enable = 'true',
+   |  hoodie.metadata.record.index.enable = 'true',
+   |  hoodie.datasource.write.recordkey.field = 'record_key_col',
+   |  hoodie.enable.data.skipping = 'true'
+   | )
+   | partitioned by(partition_key_col)
+   | location '$basePath'
+   """.stripMargin)
+  spark.sql(s"insert into $tableName values('row1', 'abc', 'p1')")

Review Comment:
   Fixed the issue in the latest commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions bootstrap [hudi]

2024-06-05 Thread via GitHub


jonvex opened a new pull request, #11399:
URL: https://github.com/apache/hudi/pull/11399

   ### Change Logs
   
   Testing hive 3 bootstrap read using the bundle validation setup
   
   ### Impact
   
   see if hive 3 works as expected
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] using spark's observe feature on dataframes saved by hudi is stuck [hudi]

2024-06-05 Thread via GitHub


szingerpeter commented on issue #11367:
URL: https://github.com/apache/hudi/issues/11367#issuecomment-2150610226

   @ad1happy2go , thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions [hudi]

2024-06-05 Thread via GitHub


jonvex opened a new pull request, #11398:
URL: https://github.com/apache/hudi/pull/11398

   ### Change Logs
   
   Testing hive 3 using the bundle validation setup
   
   ### Impact
   
   see if hive 3 works as expected
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7831) Support secondary index reads using native HFile reader

2024-06-05 Thread Sagar Sumit (Jira)
Sagar Sumit created HUDI-7831:
-

 Summary: Support secondary index reads using native HFile reader
 Key: HUDI-7831
 URL: https://issues.apache.org/jira/browse/HUDI-7831
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Sagar Sumit
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]

2024-06-05 Thread via GitHub


wombatu-kun commented on PR #11385:
URL: https://github.com/apache/hudi/pull/11385#issuecomment-2150016743

   > Let me know if you prefer to address the `toString()` calls in this PR. 
Also, could you raise another PR against `branch-0.x` with the same changes?
   
   hi, @yihua ! I've made it in separate commit, review it please. if it is 
enough - i'll raise PR against `branch-0.x`. Also, let me know if it's better 
to squash  all changes to a single commit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11385:
URL: https://github.com/apache/hudi/pull/11385#issuecomment-2149867032

   
   ## CI report:
   
   * 0b9134e14a349ac70defc972dd67e464c0506ae1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24230)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Unable to Use DynamoDB Based Lock with Hudi PySpark Job Locally [hudi]

2024-06-05 Thread via GitHub


soumilshah1995 commented on issue #11391:
URL: https://github.com/apache/hudi/issues/11391#issuecomment-2149868601

   Added following packages 
   
   ```
   
   HUDI_VERSION = '0.14.0'
   SPARK_VERSION = '3.4'
   
   os.environ["JAVA_HOME"] = "/opt/homebrew/opt/openjdk@11"
   
   SUBMIT_ARGS = f"--packages 
org.apache.hudi:hudi-spark{SPARK_VERSION}-bundle_2.12:{HUDI_VERSION},com.amazonaws:dynamodb-lock-client:1.2.0,com.amazonaws:aws-java-sdk-dynamodb:1.12.735,com.amazonaws:aws-java-sdk-core:1.12.735,org.apache.hudi:hudi-aws-bundle:{HUDI_VERSION},org.apache.hudi:hudi-aws:{HUDI_VERSION}
 pyspark-shell"
   
   os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
   os.environ['PYSPARK_PYTHON'] = sys.executable
   
   spark = SparkSession.builder \
   .config('spark.serializer', 
'org.apache.spark.serializer.KryoSerializer') \
   .config('spark.sql.extensions', 
'org.apache.spark.sql.hudi.HoodieSparkSessionExtension') \
   .config('className', 'org.apache.hudi') \
   .config('spark.sql.hive.convertMetastoreParquet', 'false') \
   .getOrCreate()
   
   ```
   
   
   
   # Error : org.apache.hudi.exception.HoodieException: Unable to instantiate 
class org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
   
   
   ```
   g.apache.hudi#hudi-aws-bundle added as a dependency
   org.apache.hudi#hudi-aws added as a dependency
   :: resolving dependencies :: 
org.apache.spark#spark-submit-parent-aa8d9c29-7056-4201-b20a-c5f73fac7ea9;1.0
confs: [default]
found org.apache.hudi#hudi-spark3.4-bundle_2.12;0.14.0 in spark-list
found com.amazonaws#dynamodb-lock-client;1.2.0 in central
found software.amazon.awssdk#dynamodb;2.20.8 in central
found software.amazon.awssdk#aws-json-protocol;2.20.8 in central
found software.amazon.awssdk#aws-core;2.20.8 in central
found software.amazon.awssdk#annotations;2.20.8 in central
found software.amazon.awssdk#regions;2.20.8 in central
found software.amazon.awssdk#utils;2.20.8 in central
found org.reactivestreams#reactive-streams;1.0.2 in central
found org.slf4j#slf4j-api;1.7.30 in local-m2-cache
found software.amazon.awssdk#sdk-core;2.20.8 in central
found software.amazon.awssdk#http-client-spi;2.20.8 in central
found software.amazon.awssdk#metrics-spi;2.20.8 in central
found software.amazon.awssdk#endpoints-spi;2.20.8 in central
found software.amazon.awssdk#profiles;2.20.8 in central
found software.amazon.awssdk#json-utils;2.20.8 in central
found software.amazon.awssdk#third-party-jackson-core;2.20.8 in central
found software.amazon.awssdk#auth;2.20.8 in central
found software.amazon.eventstream#eventstream;1.0.1 in central
found software.amazon.awssdk#protocol-core;2.20.8 in central
found software.amazon.awssdk#apache-client;2.20.8 in central
found org.apache.httpcomponents#httpclient;4.5.13 in local-m2-cache
found org.apache.httpcomponents#httpcore;4.4.13 in local-m2-cache
found commons-logging#commons-logging;1.2 in local-m2-cache
found software.amazon.awssdk#netty-nio-client;2.20.8 in central
found io.netty#netty-codec-http2;4.1.86.Final in central
found io.netty#netty-common;4.1.86.Final in central
found io.netty#netty-buffer;4.1.86.Final in central
found io.netty#netty-transport;4.1.86.Final in central
found io.netty#netty-resolver;4.1.86.Final in central
found io.netty#netty-codec;4.1.86.Final in central
found io.netty#netty-transport-classes-epoll;4.1.86.Final in central
found io.netty#netty-transport-native-unix-common;4.1.86.Final in 
central
found com.amazonaws#aws-java-sdk-dynamodb;1.12.735 in central
found com.amazonaws#aws-java-sdk-s3;1.12.735 in central
found com.amazonaws#aws-java-sdk-kms;1.12.735 in central
found com.amazonaws#aws-java-sdk-core;1.12.735 in central
found commons-codec#commons-codec;1.15 in local-m2-cache
found com.fasterxml.jackson.core#jackson-databind;2.12.7.2 in central
found com.fasterxml.jackson.core#jackson-annotations;2.12.7 in 
local-m2-cache
found com.fasterxml.jackson.core#jackson-core;2.12.7 in local-m2-cache
found com.fasterxml.jackson.dataformat#jackson-dataformat-cbor;2.12.6 
in central
found joda-time#joda-time;2.12.7 in central
found com.amazonaws#jmespath-java;1.12.735 in central
found org.apache.hudi#hudi-aws-bundle;0.14.0 in central
found org.apache.hudi#hudi-common;0.14.0 in central
found org.openjdk.jol#jol-core;0.16 in local-m2-cache
found com.fasterxml.jackson.datatype#jackson-datatype-jsr310;2.10.0 in 
local-m2-cache
found com.github.ben-manes.caffeine#caffeine;2.9.1 in local-m2-cache
found org.checkerframework#checker-qual;3.10.0 in local-m2-cache
found com.google.errorprone#error_prone_annotati

Re: [PR] [HUDI-7830] Add predicate filter pruning for snapshot queries in hudi related sources [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11396:
URL: https://github.com/apache/hudi/pull/11396#issuecomment-2149867215

   
   ## CI report:
   
   * d39943c1608d0a18e25e8b13f9bf6900c684253f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24231)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Unable to Use DynamoDB Based Lock with Hudi PySpark Job Locally [hudi]

2024-06-05 Thread via GitHub


soumilshah1995 commented on issue #11391:
URL: https://github.com/apache/hudi/issues/11391#issuecomment-2149831661

   # Code
   ```
   
   HUDI_VERSION = '0.14.0'
   SPARK_VERSION = '3.4'
   
   os.environ["JAVA_HOME"] = "/opt/homebrew/opt/openjdk@11"
   
   AWS_JAR_FILES = 
f"org.apache.hudi:hudi-aws:{HUDI_VERSION},org.apache.hudi:hudi-aws-bundle:{HUDI_VERSION}"
   SUBMIT_ARGS = f"--packages 
org.apache.hudi:hudi-spark3.4.1-bundle_2.12:{HUDI_VERSION},{AWS_JAR_FILES} 
pyspark-shell"
   
   os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
   os.environ['PYSPARK_PYTHON'] = sys.executable
   
   spark = SparkSession.builder \
   .config('spark.serializer', 
'org.apache.spark.serializer.KryoSerializer') \
   .config('spark.sql.extensions', 
'org.apache.spark.sql.hudi.HoodieSparkSessionExtension') \
   .config('className', 'org.apache.hudi') \
   .config('spark.sql.hive.convertMetastoreParquet', 'false') \
   .getOrCreate()
   ```
   
   # Error 
   ```
   
python3 w1.py
   Imports loaded successfully.
   Warning: Ignoring non-Spark config property: className
   :: loading settings :: url = 
jar:file:/opt/anaconda3/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
   Ivy Default Cache set to: /Users/soumilshah/.ivy2/cache
   The jars for the packages stored in: /Users/soumilshah/.ivy2/jars
   org.apache.hudi#hudi-spark3.4.1-bundle_2.12 added as a dependency
   org.apache.hudi#hudi-aws added as a dependency
   org.apache.hudi#hudi-aws-bundle added as a dependency
   :: resolving dependencies :: 
org.apache.spark#spark-submit-parent-9c6c8274-f28f-4a73-b9e9-c27219acefce;1.0
confs: [default]
found org.apache.hudi#hudi-aws;0.14.0 in central
found org.apache.hudi#hudi-common;0.14.0 in central
found org.openjdk.jol#jol-core;0.16 in local-m2-cache
found com.fasterxml.jackson.core#jackson-annotations;2.10.0 in 
local-m2-cache
found com.fasterxml.jackson.core#jackson-databind;2.10.0 in 
local-m2-cache
found com.fasterxml.jackson.core#jackson-core;2.10.0 in local-m2-cache
found com.fasterxml.jackson.datatype#jackson-datatype-jsr310;2.10.0 in 
local-m2-cache
found com.github.ben-manes.caffeine#caffeine;2.9.1 in local-m2-cache
found org.checkerframework#checker-qual;3.10.0 in local-m2-cache
found com.google.errorprone#error_prone_annotations;2.5.1 in 
local-m2-cache
found org.apache.orc#orc-core;1.6.0 in local-m2-cache
found org.apache.orc#orc-shims;1.6.0 in local-m2-cache
found org.slf4j#slf4j-api;1.7.36 in local-m2-cache
found com.google.protobuf#protobuf-java;3.21.7 in local-m2-cache
found commons-lang#commons-lang;2.6 in local-m2-cache
found io.airlift#aircompressor;0.15 in local-m2-cache
found javax.xml.bind#jaxb-api;2.2.11 in local-m2-cache
found org.apache.hive#hive-storage-api;2.6.0 in local-m2-cache
found org.jetbrains#annotations;17.0.0 in local-m2-cache
found org.roaringbitmap#RoaringBitmap;0.9.47 in local-m2-cache
found org.apache.httpcomponents#fluent-hc;4.4.1 in local-m2-cache
found commons-logging#commons-logging;1.2 in local-m2-cache
found org.rocksdb#rocksdbjni;7.5.3 in local-m2-cache
found org.apache.hbase#hbase-client;2.4.9 in local-m2-cache
found org.apache.hbase.thirdparty#hbase-shaded-protobuf;3.5.1 in 
local-m2-cache
found org.apache.hbase#hbase-protocol-shaded;2.4.9 in local-m2-cache
found org.apache.yetus#audience-annotations;0.5.0 in local-m2-cache
found org.apache.hbase#hbase-protocol;2.4.9 in local-m2-cache
found javax.annotation#javax.annotation-api;1.2 in local-m2-cache
found commons-codec#commons-codec;1.13 in local-m2-cache
found commons-io#commons-io;2.11.0 in local-m2-cache
found org.apache.commons#commons-lang3;3.9 in local-m2-cache
found org.apache.hbase.thirdparty#hbase-shaded-miscellaneous;3.5.1 in 
local-m2-cache
found com.google.errorprone#error_prone_annotations;2.7.1 in 
local-m2-cache
found org.apache.hbase.thirdparty#hbase-shaded-netty;3.5.1 in 
local-m2-cache
found org.apache.zookeeper#zookeeper;3.5.7 in local-m2-cache
found org.apache.zookeeper#zookeeper-jute;3.5.7 in local-m2-cache
found io.netty#netty-handler;4.1.45.Final in local-m2-cache
found io.netty#netty-common;4.1.45.Final in local-m2-cache
found io.netty#netty-buffer;4.1.45.Final in local-m2-cache
found io.netty#netty-transport;4.1.45.Final in local-m2-cache
found io.netty#netty-resolver;4.1.45.Final in local-m2-cache
found io.netty#netty-codec;4.1.45.Final in local-m2-cache
found io.netty#netty-transport-native-epoll;4.1.45.Final in 
local-m2-cache
found io.netty#netty-transport-native-unix-common;4.1.45.Final in 
local-m2-cache
found org.apache.htrace#htrace-core4;4.2.0-i

Re: [I] [SUPPORT] Serde properties missing after migrate from hivesync to gluesync [hudi]

2024-06-05 Thread via GitHub


prathit06 commented on issue #11397:
URL: https://github.com/apache/hudi/issues/11397#issuecomment-2149827641

   I have fixed this for our internal use & would like to contribute the same. 
Kindly access & let me know if any other information is required on the same.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Serde properties missing after migrate from hivesync to gluesync [hudi]

2024-06-05 Thread via GitHub


prathit06 opened a new issue, #11397:
URL: https://github.com/apache/hudi/issues/11397

   **Describe the problem you faced** 
   - We used hive sync to sync tables to glue for hudi version 0.8, 0.10.0, 
0.11.1. After sometime we started using glue sync in hudi version 0.11.1 & have 
recently migrated our workload to 0.13.1.
   - After migration to 0.13.1 we have started facing errors wherein serde 
properties are missing in table DDL & when we try to read table using spark we 
get below error
   
   ```org.apache.hudi.exception.HoodieException: 'path' or 'Key: 
'hoodie.datasource.read.paths' , default: null description: Comma separated 
list of file paths to read within a Hudi table. since version: version is not 
defined deprecated after: version is not defined)' or both must be specified.```
   
   A clear and concise description of the problem.
   - Not able to read hudi table from spark due to missing serDe properties 
after we migrated to 0.13.1 from 0.11.1 & changed from hive sync to glue sync
   
   **To Reproduce**
   - Create a table using hudi 0.8 using hive sync, upgrade hudi version to 
0.10, upgrade to 0.11.1, add a new column & sync using hive sync.
   - Add a new column & sync the table using glue sync
   - Update to 0.13.1, add a new column & sync the table
   - Check table DDL & serde properties should be missing from the create DDL 
when checked on spark
   
   **Expected behaviour**
   Expected behaviour is for serde properties to be present so spark can read 
the hudi table
   
   **Environment Description**
   
   * Hudi version : 0.13.1
   
   * Spark version : 3.1.2
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   ```24/05/30 02:41:24 ERROR DataSync: Got error in executing Data Sync job
   org.apache.hudi.exception.HoodieException: 'path' or 'Key: 
'hoodie.datasource.read.paths' , default: null description: Comma separated 
list of file paths to read within a Hudi table. since version: version is not 
defined deprecated after: version is not defined)' or both must be specified.
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:77)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:353)
at 
org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anon$1.call(DataSourceStrategy.scala:270)
at 
org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anon$1.call(DataSourceStrategy.scala:256)
at 
org.apache.spark.sql.execution.datasources.FindDataSourceTable.org$apache$spark$sql$execution$datasources$FindDataSourceTable$$readDataSourceTable(DataSourceStrategy.scala:275)
at 
org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:325)
at 
org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:311)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:75)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$4(AnalysisHelper.scala:113)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.applyFunctionIfChanged$1(TreeNode.scala:388)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:424)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:422)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:370)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:113)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scal

Re: [I] [SUPPORT] Unable to Use DynamoDB Based Lock with Hudi PySpark Job Locally [hudi]

2024-06-05 Thread via GitHub


soumilshah1995 closed issue #11391: [SUPPORT] Unable to Use DynamoDB Based Lock 
with Hudi PySpark Job Locally
URL: https://github.com/apache/hudi/issues/11391


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Unable to Use DynamoDB Based Lock with Hudi PySpark Job Locally [hudi]

2024-06-05 Thread via GitHub


soumilshah1995 commented on issue #11391:
URL: https://github.com/apache/hudi/issues/11391#issuecomment-2149620227

   oh let me try this and update the thread shortly 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7830] Add predicate filter pruning for snapshot queries in hudi related sources [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11396:
URL: https://github.com/apache/hudi/pull/11396#issuecomment-2149598890

   
   ## CI report:
   
   * 5dc3a94d9c3acb593b0c993e7ffa3b415e917774 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24229)
 
   * d39943c1608d0a18e25e8b13f9bf6900c684253f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24231)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11385:
URL: https://github.com/apache/hudi/pull/11385#issuecomment-2149598779

   
   ## CI report:
   
   * 064b5310f709e5886dd7e278d1ebf9cdcfbe70c7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24206)
 
   * 0b9134e14a349ac70defc972dd67e464c0506ae1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24230)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11385:
URL: https://github.com/apache/hudi/pull/11385#issuecomment-2149582706

   
   ## CI report:
   
   * 064b5310f709e5886dd7e278d1ebf9cdcfbe70c7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24206)
 
   * 0b9134e14a349ac70defc972dd67e464c0506ae1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7830] Add predicate filter pruning for snapshot queries in hudi related sources [hudi]

2024-06-05 Thread via GitHub


hudi-bot commented on PR #11396:
URL: https://github.com/apache/hudi/pull/11396#issuecomment-2149582843

   
   ## CI report:
   
   * 5dc3a94d9c3acb593b0c993e7ffa3b415e917774 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24229)
 
   * d39943c1608d0a18e25e8b13f9bf6900c684253f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >