Re: [I] [SUPPORT] Serde properties missing after migrate from hivesync to gluesync [hudi]
prathit06 commented on issue #11397: URL: https://github.com/apache/hudi/issues/11397#issuecomment-2151534901 @danny0405 Please review : https://github.com/apache/hudi/pull/11404 Also could you please create a jira for this so i can add it to PR, thank you ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Fix missing serDe properties post migration from hiveSync to glueSync [hudi]
prathit06 opened a new pull request, #11404: URL: https://github.com/apache/hudi/pull/11404 ### Change Logs Add serDe properties to table DDL if missing after migration from hive sync to glue sync. More context : https://github.com/apache/hudi/issues/11397 ### Impact _Describe any public API or user-facing feature change or any performance impact._ : NA ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ : NA ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none"._ : NA - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Unable to Use DynamoDB Based Lock with Hudi PySpark Job Locally [hudi]
ad1happy2go commented on issue #11391: URL: https://github.com/apache/hudi/issues/11391#issuecomment-2151489857 @soumilshah1995 Looks like AWS SDK bundle version conflicts with hudi-aws-bundle. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7834) Setup table versions to differentiate HUDI 0.16.x and 1.0-beta versions
Balaji Varadarajan created HUDI-7834: Summary: Setup table versions to differentiate HUDI 0.16.x and 1.0-beta versions Key: HUDI-7834 URL: https://issues.apache.org/jira/browse/HUDI-7834 Project: Apache Hudi Issue Type: Improvement Reporter: Balaji Varadarajan -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7834) Setup table versions to differentiate HUDI 0.16.x and 1.0-beta versions
[ https://issues.apache.org/jira/browse/HUDI-7834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-7834: Assignee: Balaji Varadarajan > Setup table versions to differentiate HUDI 0.16.x and 1.0-beta versions > --- > > Key: HUDI-7834 > URL: https://issues.apache.org/jira/browse/HUDI-7834 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT] It failed to compile raw hudi src with error "oodieTableMetadataUtil.java:[189,7] no suitable method found for collect(java.util.stream.Collector
danny0405 commented on issue #5552: URL: https://github.com/apache/hudi/issues/5552#issuecomment-2151362100 We did have a fix for windows OS path with special back slash, do you encounter any issues for complire on windows OS ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2151300459 ## CI report: * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN * 1e677e9b8b5d79cb23e85f2577407f9be840c762 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24242) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Serde properties missing after migrate from hivesync to gluesync [hudi]
danny0405 commented on issue #11397: URL: https://github.com/apache/hudi/issues/11397#issuecomment-2151283036 > I have fixed this for our internal use & would like to contribute the same That's great, can you share the patch with us. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] [hudi]
danny0405 commented on issue #11403: URL: https://github.com/apache/hudi/issues/11403#issuecomment-2151279640 I would suggest you use the 0.12.3 or 0.14.1, 0.12.1 still got some stability issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6787) Hive Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and RealtimeCompactedRecordReader for Hive
[ https://issues.apache.org/jira/browse/HUDI-6787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6787: Reviewers: Balaji Varadarajan > Hive Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and > RealtimeCompactedRecordReader for Hive > -- > > Key: HUDI-6787 > URL: https://issues.apache.org/jira/browse/HUDI-6787 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Jonathan Vexler >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7384) Implement writer path support for secondary index
[ https://issues.apache.org/jira/browse/HUDI-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit closed HUDI-7384. - Fix Version/s: 1.0.0 Resolution: Done > Implement writer path support for secondary index > - > > Key: HUDI-7384 > URL: https://issues.apache.org/jira/browse/HUDI-7384 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > # Basic initialization ona. existing table > # Handle inserts/upserts -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7405) Implement reader path support for secondary index
[ https://issues.apache.org/jira/browse/HUDI-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7405: -- Status: In Progress (was: Open) > Implement reader path support for secondary index > - > > Key: HUDI-7405 > URL: https://issues.apache.org/jira/browse/HUDI-7405 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7795) Fix loading of input splits from look up table reader
[ https://issues.apache.org/jira/browse/HUDI-7795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo closed HUDI-7795. --- Resolution: Fixed > Fix loading of input splits from look up table reader > - > > Key: HUDI-7795 > URL: https://issues.apache.org/jira/browse/HUDI-7795 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7405) Implement reader path support for secondary index
[ https://issues.apache.org/jira/browse/HUDI-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7405: -- Status: Patch Available (was: In Progress) > Implement reader path support for secondary index > - > > Key: HUDI-7405 > URL: https://issues.apache.org/jira/browse/HUDI-7405 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits
[ https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7779: Fix Version/s: 0.16.0 1.0.0 > Guarding archival to not archive unintended commits > --- > > Key: HUDI-7779 > URL: https://issues.apache.org/jira/browse/HUDI-7779 > Project: Apache Hudi > Issue Type: Bug > Components: archiving >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Fix For: 0.16.0, 1.0.0 > > > Archiving commits from active timeline could lead to data consistency issues > on some rarest of occasions. We should come up with proper guards to ensure > we do not make such unintended archival. > > Major gap which we wanted to guard is: > if someone disabled cleaner, archival should account for data consistency > issues and ensure it bails out. > We have a base guarding condition, where archival will stop at the earliest > commit to retain based on latest clean commit metadata. But there are few > other scenarios that needs to be accounted for. > > a. Keeping aside replace commits, lets dive into specifics for regular > commits and delta commits. > Say user configured clean commits to 4 and archival configs to 5 and 6. after > t10, cleaner is supposed to clean up all file versions created at or before > t6. Say cleaner did not run(for whatever reason for next 5 commits). > Archival will certainly be guarded until earliest commit to retain based > on latest clean commits. > Corner case to consider: > A savepoint was added to say t3 and later removed. and still the cleaner was > never re-enabled. Even though archival would have been stopped at t3 (until > savepoint is present),but once savepoint is removed, if archival is executed, > it could archive commit t3. Which means, file versions tracked at t3 is still > not yet cleaned by the cleaner. > Reasoning: > We are good here wrt data consistency. Up until cleaner runs next time, this > older file versions might be exposed to the end-user. But time travel query > is not intended for already cleaned up commits and hence this is not an > issue. None of snapshot, time travel query or incremental query will run into > issues as they are not supposed to poll for t3. > At any later point, if cleaner is re-enabled, it will take care of cleaning > up file versions tracked at t3 commit. Just that for interim period, some > older file versions might still be exposed to readers. > > b. The more tricky part is when replace commits are involved. Since replace > commit metadata in active timeline is what ensures the replaced file groups > are ignored for reads, before archiving the same, cleaner is expected to > clean them up fully. But are there chances that this could go wrong? > Corner case to consider. Lets add onto above scenario, where t3 has a > savepoint, and t4 is a replace commit which replaced file groups tracked in > t3. > Cleaner will skip cleaning up files tracked by t3(due to the presence of > savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain > will be pointing to t6. And say savepoint for t3 is removed, but cleaner was > disabled. In this state of the timeline, if archival is executed, (since > t3.savepoint is removed), archival might archive t3 and t4.rc. This could > lead to data duplicates as both replaced file groups and new file groups from > t4.rc would be exposed as valid file groups. > > In other words, if we were to summarize the different scenarios: > i. replaced file group is never cleaned up. > - ECTR(Earliest commit to retain) is less than this.rc and we are good. > ii. replaced file group is cleaned up. > - ECTR is > this.rc and is good to archive. > iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full > clean up did not happen. After savepoint is removed, and when archival is > executed, we should avoid archiving the rc of interest. This is the gap we > don't account for as of now. > > We have 3 options to go about to solve this. > Option A: > Let Savepoint deletion flow take care of cleaning up the files its tracking. > cons: > Savepoint's responsibility is not removing any data files. So, from a single > user responsibility rule, this may not be right. Also, this clean up might > need to do what a clean planner might actually be doing. ie. build file > system view, understand if its supposed to be cleaned up already, and then > only clean up the files which are supposed to be cleaned up. For eg, if a > file group has only one file slice, it should not be cleaned up and scenarios > like this. > > Option B: > Since archival is the one which might cause data consistency issues, why not > archival do the clean up. >
[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits
[ https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7779: Status: In Progress (was: Open) > Guarding archival to not archive unintended commits > --- > > Key: HUDI-7779 > URL: https://issues.apache.org/jira/browse/HUDI-7779 > Project: Apache Hudi > Issue Type: Bug > Components: archiving >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Fix For: 0.16.0, 1.0.0 > > > Archiving commits from active timeline could lead to data consistency issues > on some rarest of occasions. We should come up with proper guards to ensure > we do not make such unintended archival. > > Major gap which we wanted to guard is: > if someone disabled cleaner, archival should account for data consistency > issues and ensure it bails out. > We have a base guarding condition, where archival will stop at the earliest > commit to retain based on latest clean commit metadata. But there are few > other scenarios that needs to be accounted for. > > a. Keeping aside replace commits, lets dive into specifics for regular > commits and delta commits. > Say user configured clean commits to 4 and archival configs to 5 and 6. after > t10, cleaner is supposed to clean up all file versions created at or before > t6. Say cleaner did not run(for whatever reason for next 5 commits). > Archival will certainly be guarded until earliest commit to retain based > on latest clean commits. > Corner case to consider: > A savepoint was added to say t3 and later removed. and still the cleaner was > never re-enabled. Even though archival would have been stopped at t3 (until > savepoint is present),but once savepoint is removed, if archival is executed, > it could archive commit t3. Which means, file versions tracked at t3 is still > not yet cleaned by the cleaner. > Reasoning: > We are good here wrt data consistency. Up until cleaner runs next time, this > older file versions might be exposed to the end-user. But time travel query > is not intended for already cleaned up commits and hence this is not an > issue. None of snapshot, time travel query or incremental query will run into > issues as they are not supposed to poll for t3. > At any later point, if cleaner is re-enabled, it will take care of cleaning > up file versions tracked at t3 commit. Just that for interim period, some > older file versions might still be exposed to readers. > > b. The more tricky part is when replace commits are involved. Since replace > commit metadata in active timeline is what ensures the replaced file groups > are ignored for reads, before archiving the same, cleaner is expected to > clean them up fully. But are there chances that this could go wrong? > Corner case to consider. Lets add onto above scenario, where t3 has a > savepoint, and t4 is a replace commit which replaced file groups tracked in > t3. > Cleaner will skip cleaning up files tracked by t3(due to the presence of > savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain > will be pointing to t6. And say savepoint for t3 is removed, but cleaner was > disabled. In this state of the timeline, if archival is executed, (since > t3.savepoint is removed), archival might archive t3 and t4.rc. This could > lead to data duplicates as both replaced file groups and new file groups from > t4.rc would be exposed as valid file groups. > > In other words, if we were to summarize the different scenarios: > i. replaced file group is never cleaned up. > - ECTR(Earliest commit to retain) is less than this.rc and we are good. > ii. replaced file group is cleaned up. > - ECTR is > this.rc and is good to archive. > iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full > clean up did not happen. After savepoint is removed, and when archival is > executed, we should avoid archiving the rc of interest. This is the gap we > don't account for as of now. > > We have 3 options to go about to solve this. > Option A: > Let Savepoint deletion flow take care of cleaning up the files its tracking. > cons: > Savepoint's responsibility is not removing any data files. So, from a single > user responsibility rule, this may not be right. Also, this clean up might > need to do what a clean planner might actually be doing. ie. build file > system view, understand if its supposed to be cleaned up already, and then > only clean up the files which are supposed to be cleaned up. For eg, if a > file group has only one file slice, it should not be cleaned up and scenarios > like this. > > Option B: > Since archival is the one which might cause data consistency issues, why not > archival do the clean up. > We need to ac
[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits
[ https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7779: Sprint: 2024/06/03-16 > Guarding archival to not archive unintended commits > --- > > Key: HUDI-7779 > URL: https://issues.apache.org/jira/browse/HUDI-7779 > Project: Apache Hudi > Issue Type: Bug > Components: archiving >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Fix For: 0.16.0, 1.0.0 > > > Archiving commits from active timeline could lead to data consistency issues > on some rarest of occasions. We should come up with proper guards to ensure > we do not make such unintended archival. > > Major gap which we wanted to guard is: > if someone disabled cleaner, archival should account for data consistency > issues and ensure it bails out. > We have a base guarding condition, where archival will stop at the earliest > commit to retain based on latest clean commit metadata. But there are few > other scenarios that needs to be accounted for. > > a. Keeping aside replace commits, lets dive into specifics for regular > commits and delta commits. > Say user configured clean commits to 4 and archival configs to 5 and 6. after > t10, cleaner is supposed to clean up all file versions created at or before > t6. Say cleaner did not run(for whatever reason for next 5 commits). > Archival will certainly be guarded until earliest commit to retain based > on latest clean commits. > Corner case to consider: > A savepoint was added to say t3 and later removed. and still the cleaner was > never re-enabled. Even though archival would have been stopped at t3 (until > savepoint is present),but once savepoint is removed, if archival is executed, > it could archive commit t3. Which means, file versions tracked at t3 is still > not yet cleaned by the cleaner. > Reasoning: > We are good here wrt data consistency. Up until cleaner runs next time, this > older file versions might be exposed to the end-user. But time travel query > is not intended for already cleaned up commits and hence this is not an > issue. None of snapshot, time travel query or incremental query will run into > issues as they are not supposed to poll for t3. > At any later point, if cleaner is re-enabled, it will take care of cleaning > up file versions tracked at t3 commit. Just that for interim period, some > older file versions might still be exposed to readers. > > b. The more tricky part is when replace commits are involved. Since replace > commit metadata in active timeline is what ensures the replaced file groups > are ignored for reads, before archiving the same, cleaner is expected to > clean them up fully. But are there chances that this could go wrong? > Corner case to consider. Lets add onto above scenario, where t3 has a > savepoint, and t4 is a replace commit which replaced file groups tracked in > t3. > Cleaner will skip cleaning up files tracked by t3(due to the presence of > savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain > will be pointing to t6. And say savepoint for t3 is removed, but cleaner was > disabled. In this state of the timeline, if archival is executed, (since > t3.savepoint is removed), archival might archive t3 and t4.rc. This could > lead to data duplicates as both replaced file groups and new file groups from > t4.rc would be exposed as valid file groups. > > In other words, if we were to summarize the different scenarios: > i. replaced file group is never cleaned up. > - ECTR(Earliest commit to retain) is less than this.rc and we are good. > ii. replaced file group is cleaned up. > - ECTR is > this.rc and is good to archive. > iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full > clean up did not happen. After savepoint is removed, and when archival is > executed, we should avoid archiving the rc of interest. This is the gap we > don't account for as of now. > > We have 3 options to go about to solve this. > Option A: > Let Savepoint deletion flow take care of cleaning up the files its tracking. > cons: > Savepoint's responsibility is not removing any data files. So, from a single > user responsibility rule, this may not be right. Also, this clean up might > need to do what a clean planner might actually be doing. ie. build file > system view, understand if its supposed to be cleaned up already, and then > only clean up the files which are supposed to be cleaned up. For eg, if a > file group has only one file slice, it should not be cleaned up and scenarios > like this. > > Option B: > Since archival is the one which might cause data consistency issues, why not > archival do the clean up. > We need to account for c
[I] [SUPPORT] [hudi]
zaminhassnain06 opened a new issue, #11403: URL: https://github.com/apache/hudi/issues/11403 Hi Our organization is migrating from Hudi 0.6.0 to Hudi 0.12.1 and also updating the required spark and EMR versions. Our existing data sets (100s of TBs of data on S3) are written using Hudi 0.6.0. The latest version of Hudi has come way since 0.6.0, we are not sure about how to use 0.12.1 directly. Could someone provide the steps for upgrading from 0.6.0 to 0.12.1? Do we have to rebuild our tables, we are more concerned about this as tables are having billions of records ? Should we expect following imporvements after the upgrade: – faster upserts – columns add/modify (schema evolution) – clustering – possible solution for storing history of updates performed on recrods Thanks, Zamin Hassnain -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2151246128 ## CI report: * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN * e710020df011ae0e9aac4284126dbc226533e6d5 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24238) * 1e677e9b8b5d79cb23e85f2577407f9be840c762 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24242) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
yihua commented on code in PR #10957: URL: https://github.com/apache/hudi/pull/10957#discussion_r1628627195 ## hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala: ## @@ -73,16 +84,27 @@ class SparkFileFormatInternalRowReaderContext(readerMaps: mutable.Map[Long, Part } }).asInstanceOf[ClosableIterator[InternalRow]] } else { - val schemaPairHashKey = generateSchemaPairHashKey(dataSchema, requiredSchema) - if (!readerMaps.contains(schemaPairHashKey)) { -throw new IllegalStateException("schemas don't hash to a known reader") - } - new CloseableInternalRowIterator(readerMaps(schemaPairHashKey).apply(fileInfo)) + // partition value is empty because the spark parquet reader will append the partition columns to + // each row if they are given. That is the only usage of the partition values in the reader. + val fileInfo = sparkAdapter.getSparkPartitionedFileUtils +.createPartitionedFile(InternalRow.empty, filePath, start, length) + val (readSchema, readFilters) = getSchemaAndFiltersForRead(structType) + new CloseableInternalRowIterator(parquetFileReader.read(fileInfo, +readSchema, StructType(Seq.empty), readFilters, storage.getConf.asInstanceOf[StorageConfiguration[Configuration]])) } } - private def generateSchemaPairHashKey(dataSchema: Schema, requestedSchema: Schema): Long = { -dataSchema.hashCode() + requestedSchema.hashCode() + private def getSchemaAndFiltersForRead(structType: StructType): (StructType, Seq[Filter]) = { +(getHasLogFiles, getNeedsBootstrapMerge, getUseRecordPosition) match { Review Comment: The controlling flag looks incorrect: `shouldUseRecordPosition` controls the merging based on record positions from the log files, not whether to read record positions from the parquet file with the Spark 3.5 parquet reader (along with filter pushdown). Only in Spark 3.5, when reading from the parquet base file, the reader should fetch the positions from the Spark parquet row index meta column, instead of counting the position inside Hudi. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2151233755 ## CI report: * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN * e710020df011ae0e9aac4284126dbc226533e6d5 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24238) * 1e677e9b8b5d79cb23e85f2577407f9be840c762 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
yihua commented on code in PR #10957: URL: https://github.com/apache/hudi/pull/10957#discussion_r1628612267 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieHadoopFsRelationFactory.scala: ## @@ -161,15 +167,14 @@ abstract class HoodieBaseHadoopFsRelationFactory(val sqlContext: SQLContext, val shouldExtractPartitionValueFromPath = optParams.getOrElse(DataSourceReadOptions.EXTRACT_PARTITION_VALUES_FROM_PARTITION_PATH.key, DataSourceReadOptions.EXTRACT_PARTITION_VALUES_FROM_PARTITION_PATH.defaultValue.toString).toBoolean -val shouldUseBootstrapFastRead = optParams.getOrElse(DATA_QUERIES_ONLY.key(), "false").toBoolean - -shouldOmitPartitionColumns || shouldExtractPartitionValueFromPath || shouldUseBootstrapFastRead +shouldOmitPartitionColumns || shouldExtractPartitionValueFromPath } protected lazy val mandatoryFieldsForMerging: Seq[String] = Seq(recordKeyField) ++ preCombineFieldOpt.map(Seq(_)).getOrElse(Seq()) - protected lazy val shouldUseRecordPosition: Boolean = checkIfAConfigurationEnabled(HoodieReaderConfig.MERGE_USE_RECORD_POSITIONS) + //feature added in spark 3.5 + protected lazy val shouldUseRecordPosition: Boolean = checkIfAConfigurationEnabled(HoodieReaderConfig.MERGE_USE_RECORD_POSITIONS) && HoodieSparkUtils.gteqSpark3_5 Review Comment: We can still merge deletes and updates based on record positions encoded in the log block headers regardless of Spark versions, correct? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
yihua commented on code in PR #10957: URL: https://github.com/apache/hudi/pull/10957#discussion_r1628612267 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieHadoopFsRelationFactory.scala: ## @@ -161,15 +167,14 @@ abstract class HoodieBaseHadoopFsRelationFactory(val sqlContext: SQLContext, val shouldExtractPartitionValueFromPath = optParams.getOrElse(DataSourceReadOptions.EXTRACT_PARTITION_VALUES_FROM_PARTITION_PATH.key, DataSourceReadOptions.EXTRACT_PARTITION_VALUES_FROM_PARTITION_PATH.defaultValue.toString).toBoolean -val shouldUseBootstrapFastRead = optParams.getOrElse(DATA_QUERIES_ONLY.key(), "false").toBoolean - -shouldOmitPartitionColumns || shouldExtractPartitionValueFromPath || shouldUseBootstrapFastRead +shouldOmitPartitionColumns || shouldExtractPartitionValueFromPath } protected lazy val mandatoryFieldsForMerging: Seq[String] = Seq(recordKeyField) ++ preCombineFieldOpt.map(Seq(_)).getOrElse(Seq()) - protected lazy val shouldUseRecordPosition: Boolean = checkIfAConfigurationEnabled(HoodieReaderConfig.MERGE_USE_RECORD_POSITIONS) + //feature added in spark 3.5 + protected lazy val shouldUseRecordPosition: Boolean = checkIfAConfigurationEnabled(HoodieReaderConfig.MERGE_USE_RECORD_POSITIONS) && HoodieSparkUtils.gteqSpark3_5 Review Comment: We can still merge deletes and updates based on record positions encoded in the log block headers, correct? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch asf-site updated: [HUDI-4967][HUDI-4834] Improve docs for hive sync and glue sync (#11402)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 41d1021f8a7 [HUDI-4967][HUDI-4834] Improve docs for hive sync and glue sync (#11402) 41d1021f8a7 is described below commit 41d1021f8a70f9c2f2bdc049e514510b4ea1053e Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com> AuthorDate: Wed Jun 5 20:05:18 2024 -0500 [HUDI-4967][HUDI-4834] Improve docs for hive sync and glue sync (#11402) --- website/docs/syncing_aws_glue_data_catalog.md | 51 +- website/docs/syncing_metastore.md | 235 ++ 2 files changed, 176 insertions(+), 110 deletions(-) diff --git a/website/docs/syncing_aws_glue_data_catalog.md b/website/docs/syncing_aws_glue_data_catalog.md index e54c6d52887..b6f6c82a6c5 100644 --- a/website/docs/syncing_aws_glue_data_catalog.md +++ b/website/docs/syncing_aws_glue_data_catalog.md @@ -7,22 +7,61 @@ Hudi tables can sync to AWS Glue Data Catalog directly via AWS SDK. Piggyback on , `org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool` makes use of all the configurations that are taken by `HiveSyncTool` and send them to AWS Glue. -### Configurations +## Configurations -There is no additional configuration for using `AwsGlueCatalogSyncTool`; you just need to set it as one of the sync tool -classes for `HoodieStreamer` and everything configured as shown in [Sync to Hive Metastore](syncing_metastore) will -be passed along. +Most of the configurations for `AwsGlueCatalogSyncTool` are shared with `HiveSyncTool`. The example showed in +[Sync to Hive Metastore](syncing_metastore) can be used as is for sync with Glue Data Catalog, provided that the hive metastore +URL (either JDBC or thrift URI) can proxied to Glue Data Catalog, which is usually done within AWS EMR or Glue job environment. + +For Hudi streamer, users can set ```shell --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool ``` - Running AWS Glue Catalog Sync for Spark DataSource +For Spark data source writers, users can set + +```shell +hoodie.meta.sync.client.tool.class=org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool +``` + +### Avoid creating excessive versions + +Tables stored in Glue Data Catalog are versioned. And by default, every Hudi commit triggers a sync operation if enabled, regardless of having relevant metadata changes. +This can lead to too many versions kept in the catalog and eventually failing the sync operation. + +Meta-sync can be set to conditional - only sync when there are schema change or partition change. This can avoid creating +excessive versions in the catalog. Users can enable it by setting + +``` +hoodie.datasource.meta_sync.condition.sync=true +``` + +### Glue Data Catalog specific configs + +Sync to Glue Data Catalog can be optimized with other configs like + +``` +hoodie.datasource.meta.sync.glue.all_partitions_read_parallelism +hoodie.datasource.meta.sync.glue.changed_partitions_read_parallelism +hoodie.datasource.meta.sync.glue.partition_change_parallelism +``` + +[Partition indexes](https://docs.aws.amazon.com/glue/latest/dg/partition-indexes.html) can also be used by setting + +``` +hoodie.datasource.meta.sync.glue.partition_index_fields.enable +hoodie.datasource.meta.sync.glue.partition_index_fields +``` + +## Other references + +### Running AWS Glue Catalog Sync for Spark DataSource To write a Hudi table to Amazon S3 and catalog it in AWS Glue Data Catalog, you can use the options mentioned in the [AWS documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html#aws-glue-programming-etl-format-hudi-write) - Running AWS Glue Catalog Sync from EMR +### Running AWS Glue Catalog Sync from EMR If you're running HiveSyncTool on an EMR cluster backed by Glue Data Catalog as external metastore, you can simply run the sync from command line like below: diff --git a/website/docs/syncing_metastore.md b/website/docs/syncing_metastore.md index e39c5f39337..2aada772a6a 100644 --- a/website/docs/syncing_metastore.md +++ b/website/docs/syncing_metastore.md @@ -10,6 +10,118 @@ Hive metastore as well. This unlocks the capability to query Hudi tables not onl interactive query engines such as Presto and Trino. In this document, we will go through different ways to sync the Hudi table to Hive metastore. +## Spark Data Source example + +Prerequisites: setup hive metastore properly and configure the Spark installation to point to the hive metastore by placing `hive-site.xml` under `$SPARK_HOME/conf` + +Assume that + - hiveserver2 is running at port 1 + - metastore is running at port 9083 + +Then start a spark-shell with Hudi spark bundle jar as a dependency (refer to Quickstart example) + +We can run the following script to create a sample hudi table and sync it to hive
Re: [PR] [HUDI-4967][HUDI-4834] Improve docs for hive sync and glue sync [hudi]
xushiyan merged PR #11402: URL: https://github.com/apache/hudi/pull/11402 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6633) Add hms based sync to hudi website
[ https://issues.apache.org/jira/browse/HUDI-6633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu closed HUDI-6633. --- Resolution: Fixed > Add hms based sync to hudi website > -- > > Key: HUDI-6633 > URL: https://issues.apache.org/jira/browse/HUDI-6633 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: sivabalan narayanan >Assignee: Shiyan Xu >Priority: Major > Fix For: 0.15.0, 1.0.0 > > > we should add hms based sync to our hive sync page > [https://hudi.apache.org/docs/syncing_metastore] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4967) Improve docs for meta sync with TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu updated HUDI-4967: Status: Open (was: Patch Available) > Improve docs for meta sync with TimestampBasedKeyGenerator > -- > > Key: HUDI-4967 > URL: https://issues.apache.org/jira/browse/HUDI-4967 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Shiyan Xu >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > > Related fix: HUDI-4966 > We need to add docs on how to properly set the meta sync configuration, > especially the hoodie.datasource.hive_sync.partition_value_extractor, in > [https://hudi.apache.org/docs/key_generation] (for different Hudi versions, > the config can be different). Check the ticket above and PR description of > [https://github.com/apache/hudi/pull/6851] for more details. > We should also add the migration setup on the key generation page as well: > [https://hudi.apache.org/releases/release-0.12.0/#configuration-updates] > * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config > is used to extract and transform partition value during Hive sync. Its > default value has been changed from > {{SlashEncodedDayPartitionValueExtractor}} to > {{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default > value (i.e., have not set it explicitly), you are required to set the config > to {{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From > this release, if this config is not set and Hive sync is enabled, then > partition value extractor class will be *automatically inferred* on the basis > of number of partition fields and whether or not hive style partitioning is > enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-4834) Update AWSGlueCatalog syncing oage to add spark datasource example
[ https://issues.apache.org/jira/browse/HUDI-4834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu closed HUDI-4834. --- Fix Version/s: 1.0.0 Resolution: Fixed > Update AWSGlueCatalog syncing oage to add spark datasource example > -- > > Key: HUDI-4834 > URL: https://issues.apache.org/jira/browse/HUDI-4834 > Project: Apache Hudi > Issue Type: Task > Components: docs >Reporter: Bhavani Sudha >Assignee: Shiyan Xu >Priority: Minor > Labels: documentation > Fix For: 0.15.0, 1.0.0 > > > [https://hudi.apache.org/docs/next/syncing_aws_glue_data_catalog] this page > specifically talks about how to leverage this syncing mechanism via > Deltastreamer. We also need example for spark datasource here. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-4967) Improve docs for meta sync with TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu closed HUDI-4967. --- Fix Version/s: 1.0.0 Resolution: Fixed > Improve docs for meta sync with TimestampBasedKeyGenerator > -- > > Key: HUDI-4967 > URL: https://issues.apache.org/jira/browse/HUDI-4967 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Shiyan Xu >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > Related fix: HUDI-4966 > We need to add docs on how to properly set the meta sync configuration, > especially the hoodie.datasource.hive_sync.partition_value_extractor, in > [https://hudi.apache.org/docs/key_generation] (for different Hudi versions, > the config can be different). Check the ticket above and PR description of > [https://github.com/apache/hudi/pull/6851] for more details. > We should also add the migration setup on the key generation page as well: > [https://hudi.apache.org/releases/release-0.12.0/#configuration-updates] > * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config > is used to extract and transform partition value during Hive sync. Its > default value has been changed from > {{SlashEncodedDayPartitionValueExtractor}} to > {{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default > value (i.e., have not set it explicitly), you are required to set the config > to {{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From > this release, if this config is not set and Hive sync is enabled, then > partition value extractor class will be *automatically inferred* on the basis > of number of partition fields and whether or not hive style partitioning is > enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
yihua commented on code in PR #10957: URL: https://github.com/apache/hudi/pull/10957#discussion_r1628560599 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordReader.java: ## @@ -343,19 +310,19 @@ public Builder withRecordBuffer(HoodieFileGroupRecordBuffer recordBuffer) @Override public HoodieMergedLogRecordReader build() { + ValidationUtils.checkArgument(recordMerger != null); + ValidationUtils.checkArgument(recordBuffer != null); + ValidationUtils.checkArgument(readerContext != null); Review Comment: Add error message to the validation. ## hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java: ## @@ -285,8 +324,8 @@ protected Option merge(Option older, Map olderInfoMap, * 1. A set of pre-specified keys exists. * 2. The key of the record is not contained in the set. */ - protected boolean shouldSkip(T record, String keyFieldName, boolean isFullKey, Set keys) { -String recordKey = readerContext.getValue(record, readerSchema, keyFieldName).toString(); + protected boolean shouldSkip(T record, String keyFieldName, boolean isFullKey, Set keys, Schema dataBlockSchema) { Review Comment: Is `dataBlockSchema` the writer schema? Rename it as `writerSchema`? ## hudi-common/src/main/java/org/apache/hudi/internal/schema/convert/AvroInternalSchemaConverter.java: ## @@ -73,6 +77,75 @@ public static Schema convert(InternalSchema internalSchema, String name) { return buildAvroSchemaFromInternalSchema(internalSchema, name); } + public static InternalSchema pruneAvroSchemaToInternalSchema(Schema schema, InternalSchema originSchema) { Review Comment: To clarify, is this only used for internal schema? Does schema evolution incur record conversion between Row and Avro records (which should be avoided as much as possible)? ## hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java: ## @@ -275,6 +311,9 @@ protected Option merge(Option older, Map olderInfoMap, if (mergedRecord.isPresent() && !mergedRecord.get().getLeft().isDelete(mergedRecord.get().getRight(), payloadProps)) { + if (!mergedRecord.get().getRight().equals(readerSchema)) { +return Option.ofNullable((T) mergedRecord.get().getLeft().rewriteRecordWithNewSchema(mergedRecord.get().getRight(), null, readerSchema).getData()); Review Comment: Do partial updates need schema evolution handling like this? ## hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java: ## @@ -242,7 +250,44 @@ protected Pair, Schema> getRecordsIterator(HoodieDataBlock d } else { blockRecordsIterator = dataBlock.getEngineRecordIterator(readerContext); } -return Pair.of(blockRecordsIterator, dataBlock.getSchema()); +Option, Schema>> schemaEvolutionTransformerOpt = +composeEvolvedSchemaTransformer(dataBlock); + +// In case when schema has been evolved original persisted records will have to be +// transformed to adhere to the new schema +Function transformer = +schemaEvolutionTransformerOpt.map(Pair::getLeft) +.orElse(Function.identity()); + +Schema schema = schemaEvolutionTransformerOpt.map(Pair::getRight) +.orElseGet(dataBlock::getSchema); + +return Pair.of(new CloseableMappingIterator<>(blockRecordsIterator, transformer), schema); + } + + /** + * Get final Read Schema for support evolution. + * step1: find the fileSchema for current dataBlock. + * step2: determine whether fileSchema is compatible with the final read internalSchema. + * step3: merge fileSchema and read internalSchema to produce final read schema. + * + * @param dataBlock current processed block + * @return final read schema. + */ + protected Option, Schema>> composeEvolvedSchemaTransformer( + HoodieDataBlock dataBlock) { +if (internalSchema.isEmptySchema()) { + return Option.empty(); +} + +long currentInstantTime = Long.parseLong(dataBlock.getLogBlockHeader().get(INSTANT_TIME)); +InternalSchema fileSchema = InternalSchemaCache.searchSchemaAndCache(currentInstantTime, +hoodieTableMetaClient, false); Review Comment: @jonvex follow-up JIRA to track? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-4967][HUDI-4834] Improve docs for hive sync and glue sync [hudi]
xushiyan commented on PR #11402: URL: https://github.com/apache/hudi/pull/11402#issuecomment-2151191188 ![screencapture-localhost-3000-docs-next-syncing-aws-glue-data-catalog-2024-06-05-19_51_01](https://github.com/apache/hudi/assets/2701446/ca644d33-870c-4a0e-9515-e4c647fb3646) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-4967][HUDI-4834] Improve docs for hive sync and glue sync [hudi]
xushiyan commented on PR #11402: URL: https://github.com/apache/hudi/pull/11402#issuecomment-2151190860 ![screencapture-localhost-3000-docs-next-syncing-metastore-2024-06-05-19_52_22](https://github.com/apache/hudi/assets/2701446/81929e63-3831-45d1-8303-07f0139840b9) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
jonvex commented on code in PR #10957: URL: https://github.com/apache/hudi/pull/10957#discussion_r1628598510 ## hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala: ## @@ -46,21 +49,27 @@ import scala.collection.mutable * * This uses Spark parquet reader to read parquet data files or parquet log blocks. * - * @param readermaps our intention is to build the reader inside of getFileRecordIterator, but since it is called from - * the executor, we will need to port a bunch of the code from ParquetFileFormat for each spark version - * for now, we pass in a map of the different readers we expect to create + * @param parquetFileReader A reader that transforms a {@link PartitionedFile} to an iterator of + *{@link InternalRow}. This is required for reading the base file and + *not required for reading a file group with only log files. + * @param recordKeyColumn column name for the recordkey + * @param filters spark filters that might be pushed down into the reader */ -class SparkFileFormatInternalRowReaderContext(readerMaps: mutable.Map[Long, PartitionedFile => Iterator[InternalRow]]) extends BaseSparkInternalRowReaderContext { +class SparkFileFormatInternalRowReaderContext(parquetFileReader: SparkParquetReader, + recordKeyColumn: String, + filters: Seq[Filter]) extends BaseSparkInternalRowReaderContext { lazy val sparkAdapter = SparkAdapterSupport.sparkAdapter val deserializerMap: mutable.Map[Schema, HoodieAvroDeserializer] = mutable.Map() + lazy val recordKeyFilters: Seq[Filter] = filters.filter(f => f.references.exists(c => c.equalsIgnoreCase(recordKeyColumn))) Review Comment: https://issues.apache.org/jira/browse/HUDI-7833 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7833) Validate that fg reader works with nested column as record key
Jonathan Vexler created HUDI-7833: - Summary: Validate that fg reader works with nested column as record key Key: HUDI-7833 URL: https://issues.apache.org/jira/browse/HUDI-7833 Project: Apache Hudi Issue Type: Task Reporter: Jonathan Vexler Ensure that fg reader works if the record key is a nested column -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6230) Make hive sync aws support partition indexes
[ https://issues.apache.org/jira/browse/HUDI-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu updated HUDI-6230: Fix Version/s: 0.15.0 > Make hive sync aws support partition indexes > > > Key: HUDI-6230 > URL: https://issues.apache.org/jira/browse/HUDI-6230 > Project: Apache Hudi > Issue Type: Improvement >Reporter: nicolas paris >Assignee: nicolas paris >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > glue provide indexing features, that speedup a lot partition retrieval > So far it is not supported. Having a new hive-sync configuration to activate > the feature, and optionally provide which partitions columns to index would > be helpful. > Also this is an operation that should not be done at creation table time, but > could be activated/deactivated at will > > https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#glue-best-practices-partition-index -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-1234] DO NOT MERGE use fg reader in cdc test [hudi]
hudi-bot commented on PR #11401: URL: https://github.com/apache/hudi/pull/11401#issuecomment-2151112254 ## CI report: * a4f3d9a64cc59f67bda1b9f9e045774b29213d2c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24241) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
yihua commented on code in PR #10957: URL: https://github.com/apache/hudi/pull/10957#discussion_r1628543356 ## hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala: ## @@ -101,46 +121,150 @@ class SparkFileFormatInternalRowReaderContext(readerMaps: mutable.Map[Long, Part } override def mergeBootstrapReaders(skeletonFileIterator: ClosableIterator[InternalRow], - dataFileIterator: ClosableIterator[InternalRow]): ClosableIterator[InternalRow] = { -doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]], - dataFileIterator.asInstanceOf[ClosableIterator[Any]]) + skeletonRequiredSchema: Schema, + dataFileIterator: ClosableIterator[InternalRow], + dataRequiredSchema: Schema): ClosableIterator[InternalRow] = { +doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]], skeletonRequiredSchema, + dataFileIterator.asInstanceOf[ClosableIterator[Any]], dataRequiredSchema) } - protected def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any], dataFileIterator: ClosableIterator[Any]): ClosableIterator[InternalRow] = { -new ClosableIterator[Any] { - val combinedRow = new JoinedRow() - - override def hasNext: Boolean = { -//If the iterators are out of sync it is probably due to filter pushdown -checkState(dataFileIterator.hasNext == skeletonFileIterator.hasNext, - "Bootstrap data-file iterator and skeleton-file iterator have to be in-sync!") -dataFileIterator.hasNext && skeletonFileIterator.hasNext + protected def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any], + skeletonRequiredSchema: Schema, + dataFileIterator: ClosableIterator[Any], + dataRequiredSchema: Schema): ClosableIterator[InternalRow] = { +if (getUseRecordPosition) { + assert(AvroSchemaUtils.containsFieldInSchema(skeletonRequiredSchema, ROW_INDEX_TEMPORARY_COLUMN_NAME)) + assert(AvroSchemaUtils.containsFieldInSchema(dataRequiredSchema, ROW_INDEX_TEMPORARY_COLUMN_NAME)) + val javaSet = new java.util.HashSet[String]() + javaSet.add(ROW_INDEX_TEMPORARY_COLUMN_NAME) + val skeletonProjection = projectRecord(skeletonRequiredSchema, +AvroSchemaUtils.removeFieldsFromSchema(skeletonRequiredSchema, javaSet)) + //If we have log files, we will want to do position based merging with those as well, + //so leave the row index column at the end + val dataProjection = if (getHasLogFiles) { +getIdentityProjection + } else { +projectRecord(dataRequiredSchema, + AvroSchemaUtils.removeFieldsFromSchema(dataRequiredSchema, javaSet)) } - override def next(): Any = { -(skeletonFileIterator.next(), dataFileIterator.next()) match { - case (s: ColumnarBatch, d: ColumnarBatch) => -val numCols = s.numCols() + d.numCols() -val vecs: Array[ColumnVector] = new Array[ColumnVector](numCols) -for (i <- 0 until numCols) { - if (i < s.numCols()) { -vecs(i) = s.column(i) + //Always use internal row for positional merge because + //we need to iterate row by row when merging + new CachingIterator[InternalRow] { +val combinedRow = new JoinedRow() + +//position column will always be at the end of the row +private def getPos(row: InternalRow): Long = { + row.getLong(row.numFields-1) +} + +private def getNextSkeleton: (InternalRow, Long) = { + val nextSkeletonRow = skeletonFileIterator.next().asInstanceOf[InternalRow] + (nextSkeletonRow, getPos(nextSkeletonRow)) +} + +private def getNextData: (InternalRow, Long) = { + val nextSkeletonRow = skeletonFileIterator.next().asInstanceOf[InternalRow] + (nextSkeletonRow, getPos(nextSkeletonRow)) +} + +override def close(): Unit = { + skeletonFileIterator.close() + dataFileIterator.close() +} + +override protected def doHasNext(): Boolean = { + if (!dataFileIterator.hasNext || !skeletonFileIterator.hasNext) { +false + } else { +var nextSkeleton = getNextSkeleton +var nextData = getNextData +while (nextSkeleton._2 != nextData._2) { + if (nextSkeleton._2 > nextData._2) { +if (!dataFileIterator.hasNext) { + return false +} else { + nextData = getNextData +} } else { -vecs(i) = d.column(i - s.numCols()) +if (!skeletonFileIterator.hasNext) { + return fal
[jira] [Closed] (HUDI-1964) Update guide around hive metastore and hive sync for hudi tables
[ https://issues.apache.org/jira/browse/HUDI-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu closed HUDI-1964. --- Resolution: Duplicate > Update guide around hive metastore and hive sync for hudi tables > > > Key: HUDI-1964 > URL: https://issues.apache.org/jira/browse/HUDI-1964 > Project: Apache Hudi > Issue Type: Task > Components: docs >Reporter: Nishith Agarwal >Assignee: Shiyan Xu >Priority: Minor > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1964) Update guide around hive metastore and hive sync for hudi tables
[ https://issues.apache.org/jira/browse/HUDI-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu updated HUDI-1964: Fix Version/s: 1.0.0 > Update guide around hive metastore and hive sync for hudi tables > > > Key: HUDI-1964 > URL: https://issues.apache.org/jira/browse/HUDI-1964 > Project: Apache Hudi > Issue Type: Task > Components: docs >Reporter: Nishith Agarwal >Assignee: Shiyan Xu >Priority: Minor > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6633) Add hms based sync to hudi website
[ https://issues.apache.org/jira/browse/HUDI-6633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu updated HUDI-6633: Fix Version/s: 1.0.0 > Add hms based sync to hudi website > -- > > Key: HUDI-6633 > URL: https://issues.apache.org/jira/browse/HUDI-6633 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: sivabalan narayanan >Assignee: Shiyan Xu >Priority: Major > Fix For: 0.15.0, 1.0.0 > > > we should add hms based sync to our hive sync page > [https://hudi.apache.org/docs/syncing_metastore] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-851) Add Documentation on partitioning data with examples and details on how to sync to Hive
[ https://issues.apache.org/jira/browse/HUDI-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu closed HUDI-851. -- Fix Version/s: 1.0.0 (was: 0.15.0) Resolution: Duplicate > Add Documentation on partitioning data with examples and details on how to > sync to Hive > --- > > Key: HUDI-851 > URL: https://issues.apache.org/jira/browse/HUDI-851 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: Bhavani Sudha >Assignee: Shiyan Xu >Priority: Minor > Labels: query-eng, user-support-issues > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-1234] DO NOT MERGE use fg reader in cdc test [hudi]
hudi-bot commented on PR #11401: URL: https://github.com/apache/hudi/pull/11401#issuecomment-2151067173 ## CI report: * a4f3d9a64cc59f67bda1b9f9e045774b29213d2c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24241) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-1234] DO NOT MERGE use fg reader in cdc test [hudi]
hudi-bot commented on PR #11401: URL: https://github.com/apache/hudi/pull/11401#issuecomment-2151058358 ## CI report: * a4f3d9a64cc59f67bda1b9f9e045774b29213d2c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated (a1ba9728310 -> 44922f160bd)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from a1ba9728310 [HUDI-7414] Remove redundant base path config in BQ sync (#11395) add 44922f160bd [MINOR] Allow recreation of metrics instance for base path (#11400) No new revisions were added by this update. Summary of changes: .../main/java/org/apache/hudi/metrics/Metrics.java | 1 + .../java/org/apache/hudi/metrics/TestMetrics.java | 62 ++ 2 files changed, 63 insertions(+) create mode 100644 hudi-common/src/test/java/org/apache/hudi/metrics/TestMetrics.java
Re: [PR] [MINOR] Allow recreation of metrics instance for base path [hudi]
yihua merged PR #11400: URL: https://github.com/apache/hudi/pull/11400 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Allow recreation of metrics instance for base path [hudi]
yihua commented on PR #11400: URL: https://github.com/apache/hudi/pull/11400#issuecomment-2151045404 Azure CI is green. https://github.com/apache/hudi/assets/2497195/8e77a102-fefa-44d3-9c8d-366546204d28";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
yihua commented on code in PR #10957: URL: https://github.com/apache/hudi/pull/10957#discussion_r1628436290 ## hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala: ## @@ -46,21 +49,27 @@ import scala.collection.mutable * * This uses Spark parquet reader to read parquet data files or parquet log blocks. * - * @param readermaps our intention is to build the reader inside of getFileRecordIterator, but since it is called from - * the executor, we will need to port a bunch of the code from ParquetFileFormat for each spark version - * for now, we pass in a map of the different readers we expect to create + * @param parquetFileReader A reader that transforms a {@link PartitionedFile} to an iterator of + *{@link InternalRow}. This is required for reading the base file and + *not required for reading a file group with only log files. + * @param recordKeyColumn column name for the recordkey + * @param filters spark filters that might be pushed down into the reader */ -class SparkFileFormatInternalRowReaderContext(readerMaps: mutable.Map[Long, PartitionedFile => Iterator[InternalRow]]) extends BaseSparkInternalRowReaderContext { +class SparkFileFormatInternalRowReaderContext(parquetFileReader: SparkParquetReader, + recordKeyColumn: String, + filters: Seq[Filter]) extends BaseSparkInternalRowReaderContext { lazy val sparkAdapter = SparkAdapterSupport.sparkAdapter val deserializerMap: mutable.Map[Schema, HoodieAvroDeserializer] = mutable.Map() + lazy val recordKeyFilters: Seq[Filter] = filters.filter(f => f.references.exists(c => c.equalsIgnoreCase(recordKeyColumn))) Review Comment: @jonvex Could you create a follow-up ticket to validate this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-1234] DO NOT MERGE use fg reader in cdc test [hudi]
jonvex opened a new pull request, #11401: URL: https://github.com/apache/hudi/pull/11401 ### Change Logs use fg reader in cdc run ci ### Impact step in making cdc reading engine agnostic ### Risk level (write none, low medium or high below) low ### Documentation Update N/A ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-7832) Refactor Deltastreamer S3/GCP Events Source to allow adding auxiliary columns from upstream.
[ https://issues.apache.org/jira/browse/HUDI-7832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-7832: Assignee: Balaji Varadarajan > Refactor Deltastreamer S3/GCP Events Source to allow adding auxiliary columns > from upstream. > - > > Key: HUDI-7832 > URL: https://issues.apache.org/jira/browse/HUDI-7832 > Project: Apache Hudi > Issue Type: Improvement > Components: deltastreamer >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Major > Fix For: 0.15.0, 1.0.0 > > > Background : [https://hudi.apache.org/blog/2021/08/23/s3-events-source/] > This Jira is to refactor the classes associated with this feature so that we > can allow users to extend functionalities such as adding more columns from > s3_meta_table to the s3_hudi_table. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7832) Refactor Deltastreamer S3/GCP Events Source to allow adding auxiliary columns from upstream.
Balaji Varadarajan created HUDI-7832: Summary: Refactor Deltastreamer S3/GCP Events Source to allow adding auxiliary columns from upstream. Key: HUDI-7832 URL: https://issues.apache.org/jira/browse/HUDI-7832 Project: Apache Hudi Issue Type: Improvement Components: deltastreamer Reporter: Balaji Varadarajan Fix For: 0.15.0, 1.0.0 Background : [https://hudi.apache.org/blog/2021/08/23/s3-events-source/] This Jira is to refactor the classes associated with this feature so that we can allow users to extend functionalities such as adding more columns from s3_meta_table to the s3_hudi_table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2150997634 ## CI report: * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN * e710020df011ae0e9aac4284126dbc226533e6d5 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24238) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Allow recreation of metrics instance for base path [hudi]
hudi-bot commented on PR #11400: URL: https://github.com/apache/hudi/pull/11400#issuecomment-2150977762 ## CI report: * 8f7123807feaa88d95dcc289364e2b8f15b43553 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24239) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-7414) Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs
[ https://issues.apache.org/jira/browse/HUDI-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu closed HUDI-7414. --- Resolution: Fixed > Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs > --- > > Key: HUDI-7414 > URL: https://issues.apache.org/jira/browse/HUDI-7414 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: nadine >Assignee: Shiyan Xu >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > > There was a jira issue filed where sarfaraz wanted to know more about the > `hoodie.gcp.bigquery.sync.base_path`. > In the BigQuerySyncConfig file, there a config property set: > [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java#L103] > But it’s not used anywhere else in the big query code base. > However, I see > [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java#L124] > being used to get the base path. The {{hoodie.gcp.bigquery.sync.base_path}} > is superfluous. I’m seeing as a config being set, but not being used > anywhere. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7414) Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs
[ https://issues.apache.org/jira/browse/HUDI-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu updated HUDI-7414: Fix Version/s: 1.0.0 (was: 0.15.0) > Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs > --- > > Key: HUDI-7414 > URL: https://issues.apache.org/jira/browse/HUDI-7414 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: nadine >Assignee: Shiyan Xu >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > > There was a jira issue filed where sarfaraz wanted to know more about the > `hoodie.gcp.bigquery.sync.base_path`. > In the BigQuerySyncConfig file, there a config property set: > [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java#L103] > But it’s not used anywhere else in the big query code base. > However, I see > [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java#L124] > being used to get the base path. The {{hoodie.gcp.bigquery.sync.base_path}} > is superfluous. I’m seeing as a config being set, but not being used > anywhere. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [MINOR] Allow recreation of metrics instance for base path [hudi]
hudi-bot commented on PR #11400: URL: https://github.com/apache/hudi/pull/11400#issuecomment-2150916443 ## CI report: * 8f7123807feaa88d95dcc289364e2b8f15b43553 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions bootstrap [hudi]
hudi-bot commented on PR #11399: URL: https://github.com/apache/hudi/pull/11399#issuecomment-2150916361 ## CI report: * 7bc15adec04d8b680ed83b532803ceef350d51a6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24237) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2150915206 ## CI report: * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN * 11862a3bd3b84cb12b0abcf8a399d2bfb56870b3 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24222) * e710020df011ae0e9aac4284126dbc226533e6d5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24238) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2150904022 ## CI report: * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN * 11862a3bd3b84cb12b0abcf8a399d2bfb56870b3 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24222) * e710020df011ae0e9aac4284126dbc226533e6d5 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [MINOR] Allow recreation of metrics instance for base path [hudi]
the-other-tim-brown opened a new pull request, #11400: URL: https://github.com/apache/hudi/pull/11400 ### Change Logs - Removes metrics entry from map when it is shutdown ### Impact Allows proper recreation of metrics instance if it was previously shutdown. This can be required by users interacting with these libraries directly ### Risk level (write none, low medium or high below) None ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions bootstrap [hudi]
hudi-bot commented on PR #11399: URL: https://github.com/apache/hudi/pull/11399#issuecomment-2150893311 ## CI report: * 2e1f5f9da800d048b39b5f119038191b9f277396 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24235) * 7bc15adec04d8b680ed83b532803ceef350d51a6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24237) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions [hudi]
hudi-bot commented on PR #11398: URL: https://github.com/apache/hudi/pull/11398#issuecomment-2150893255 ## CI report: * da8e1320dc7b7e18a35319a32342f96eff646518 UNKNOWN * ea23061e800c02c8814d50efddf303edad448be2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24236) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
jonvex commented on code in PR #10957: URL: https://github.com/apache/hudi/pull/10957#discussion_r1628362032 ## hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala: ## @@ -101,46 +121,150 @@ class SparkFileFormatInternalRowReaderContext(readerMaps: mutable.Map[Long, Part } override def mergeBootstrapReaders(skeletonFileIterator: ClosableIterator[InternalRow], - dataFileIterator: ClosableIterator[InternalRow]): ClosableIterator[InternalRow] = { -doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]], - dataFileIterator.asInstanceOf[ClosableIterator[Any]]) + skeletonRequiredSchema: Schema, + dataFileIterator: ClosableIterator[InternalRow], + dataRequiredSchema: Schema): ClosableIterator[InternalRow] = { +doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]], skeletonRequiredSchema, + dataFileIterator.asInstanceOf[ClosableIterator[Any]], dataRequiredSchema) } - protected def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any], dataFileIterator: ClosableIterator[Any]): ClosableIterator[InternalRow] = { -new ClosableIterator[Any] { - val combinedRow = new JoinedRow() - - override def hasNext: Boolean = { -//If the iterators are out of sync it is probably due to filter pushdown -checkState(dataFileIterator.hasNext == skeletonFileIterator.hasNext, - "Bootstrap data-file iterator and skeleton-file iterator have to be in-sync!") -dataFileIterator.hasNext && skeletonFileIterator.hasNext + protected def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any], + skeletonRequiredSchema: Schema, + dataFileIterator: ClosableIterator[Any], + dataRequiredSchema: Schema): ClosableIterator[InternalRow] = { +if (getUseRecordPosition) { + assert(AvroSchemaUtils.containsFieldInSchema(skeletonRequiredSchema, ROW_INDEX_TEMPORARY_COLUMN_NAME)) + assert(AvroSchemaUtils.containsFieldInSchema(dataRequiredSchema, ROW_INDEX_TEMPORARY_COLUMN_NAME)) + val javaSet = new java.util.HashSet[String]() + javaSet.add(ROW_INDEX_TEMPORARY_COLUMN_NAME) + val skeletonProjection = projectRecord(skeletonRequiredSchema, +AvroSchemaUtils.removeFieldsFromSchema(skeletonRequiredSchema, javaSet)) + //If we have log files, we will want to do position based merging with those as well, + //so leave the row index column at the end + val dataProjection = if (getHasLogFiles) { +getIdentityProjection + } else { +projectRecord(dataRequiredSchema, + AvroSchemaUtils.removeFieldsFromSchema(dataRequiredSchema, javaSet)) } - override def next(): Any = { -(skeletonFileIterator.next(), dataFileIterator.next()) match { - case (s: ColumnarBatch, d: ColumnarBatch) => -val numCols = s.numCols() + d.numCols() -val vecs: Array[ColumnVector] = new Array[ColumnVector](numCols) -for (i <- 0 until numCols) { - if (i < s.numCols()) { -vecs(i) = s.column(i) + //Always use internal row for positional merge because + //we need to iterate row by row when merging + new CachingIterator[InternalRow] { +val combinedRow = new JoinedRow() + +//position column will always be at the end of the row +private def getPos(row: InternalRow): Long = { + row.getLong(row.numFields-1) +} + +private def getNextSkeleton: (InternalRow, Long) = { + val nextSkeletonRow = skeletonFileIterator.next().asInstanceOf[InternalRow] + (nextSkeletonRow, getPos(nextSkeletonRow)) +} + +private def getNextData: (InternalRow, Long) = { + val nextSkeletonRow = skeletonFileIterator.next().asInstanceOf[InternalRow] + (nextSkeletonRow, getPos(nextSkeletonRow)) +} + +override def close(): Unit = { + skeletonFileIterator.close() + dataFileIterator.close() +} + +override protected def doHasNext(): Boolean = { + if (!dataFileIterator.hasNext || !skeletonFileIterator.hasNext) { +false + } else { +var nextSkeleton = getNextSkeleton +var nextData = getNextData +while (nextSkeleton._2 != nextData._2) { + if (nextSkeleton._2 > nextData._2) { +if (!dataFileIterator.hasNext) { + return false +} else { + nextData = getNextData +} } else { -vecs(i) = d.column(i - s.numCols()) +if (!skeletonFileIterator.hasNext) { + return fa
Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions bootstrap [hudi]
hudi-bot commented on PR #11399: URL: https://github.com/apache/hudi/pull/11399#issuecomment-2150809888 ## CI report: * 2e1f5f9da800d048b39b5f119038191b9f277396 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24235) * 7bc15adec04d8b680ed83b532803ceef350d51a6 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions [hudi]
hudi-bot commented on PR #11398: URL: https://github.com/apache/hudi/pull/11398#issuecomment-2150809795 ## CI report: * da8e1320dc7b7e18a35319a32342f96eff646518 UNKNOWN * 723b5a29eb4a7f872bb4436f8d6c612edf97a4d4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24234) * ea23061e800c02c8814d50efddf303edad448be2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24236) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions [hudi]
hudi-bot commented on PR #11398: URL: https://github.com/apache/hudi/pull/11398#issuecomment-2150796622 ## CI report: * da8e1320dc7b7e18a35319a32342f96eff646518 UNKNOWN * 723b5a29eb4a7f872bb4436f8d6c612edf97a4d4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24234) * ea23061e800c02c8814d50efddf303edad448be2 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions bootstrap [hudi]
hudi-bot commented on PR #11399: URL: https://github.com/apache/hudi/pull/11399#issuecomment-2150796674 ## CI report: * 2e1f5f9da800d048b39b5f119038191b9f277396 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24235) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]
hudi-bot commented on PR #11162: URL: https://github.com/apache/hudi/pull/11162#issuecomment-2150795998 ## CI report: * b342d8f8e10f77419bf1bd0bc9f626a596ad65f9 UNKNOWN * 8a9986ae4b8712c0e2e700aeb40a1e4c041fde0e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24233) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions [hudi]
hudi-bot commented on PR #11398: URL: https://github.com/apache/hudi/pull/11398#issuecomment-2150783095 ## CI report: * da8e1320dc7b7e18a35319a32342f96eff646518 UNKNOWN * 723b5a29eb4a7f872bb4436f8d6c612edf97a4d4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24234) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]
hudi-bot commented on PR #11162: URL: https://github.com/apache/hudi/pull/11162#issuecomment-2150782235 ## CI report: * b342d8f8e10f77419bf1bd0bc9f626a596ad65f9 UNKNOWN * 7a0a21f67d6cfc5a17cd1e04abec99dfb6fd53f1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24232) * 8a9986ae4b8712c0e2e700aeb40a1e4c041fde0e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24233) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
linliu-code commented on code in PR #10957: URL: https://github.com/apache/hudi/pull/10957#discussion_r1628272787 ## hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala: ## @@ -101,46 +121,150 @@ class SparkFileFormatInternalRowReaderContext(readerMaps: mutable.Map[Long, Part } override def mergeBootstrapReaders(skeletonFileIterator: ClosableIterator[InternalRow], - dataFileIterator: ClosableIterator[InternalRow]): ClosableIterator[InternalRow] = { -doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]], - dataFileIterator.asInstanceOf[ClosableIterator[Any]]) + skeletonRequiredSchema: Schema, + dataFileIterator: ClosableIterator[InternalRow], + dataRequiredSchema: Schema): ClosableIterator[InternalRow] = { +doBootstrapMerge(skeletonFileIterator.asInstanceOf[ClosableIterator[Any]], skeletonRequiredSchema, + dataFileIterator.asInstanceOf[ClosableIterator[Any]], dataRequiredSchema) } - protected def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any], dataFileIterator: ClosableIterator[Any]): ClosableIterator[InternalRow] = { -new ClosableIterator[Any] { - val combinedRow = new JoinedRow() - - override def hasNext: Boolean = { -//If the iterators are out of sync it is probably due to filter pushdown -checkState(dataFileIterator.hasNext == skeletonFileIterator.hasNext, - "Bootstrap data-file iterator and skeleton-file iterator have to be in-sync!") -dataFileIterator.hasNext && skeletonFileIterator.hasNext + protected def doBootstrapMerge(skeletonFileIterator: ClosableIterator[Any], + skeletonRequiredSchema: Schema, + dataFileIterator: ClosableIterator[Any], + dataRequiredSchema: Schema): ClosableIterator[InternalRow] = { +if (getUseRecordPosition) { + assert(AvroSchemaUtils.containsFieldInSchema(skeletonRequiredSchema, ROW_INDEX_TEMPORARY_COLUMN_NAME)) + assert(AvroSchemaUtils.containsFieldInSchema(dataRequiredSchema, ROW_INDEX_TEMPORARY_COLUMN_NAME)) + val javaSet = new java.util.HashSet[String]() + javaSet.add(ROW_INDEX_TEMPORARY_COLUMN_NAME) + val skeletonProjection = projectRecord(skeletonRequiredSchema, +AvroSchemaUtils.removeFieldsFromSchema(skeletonRequiredSchema, javaSet)) + //If we have log files, we will want to do position based merging with those as well, + //so leave the row index column at the end + val dataProjection = if (getHasLogFiles) { +getIdentityProjection + } else { +projectRecord(dataRequiredSchema, + AvroSchemaUtils.removeFieldsFromSchema(dataRequiredSchema, javaSet)) } - override def next(): Any = { -(skeletonFileIterator.next(), dataFileIterator.next()) match { - case (s: ColumnarBatch, d: ColumnarBatch) => -val numCols = s.numCols() + d.numCols() -val vecs: Array[ColumnVector] = new Array[ColumnVector](numCols) -for (i <- 0 until numCols) { - if (i < s.numCols()) { -vecs(i) = s.column(i) + //Always use internal row for positional merge because + //we need to iterate row by row when merging + new CachingIterator[InternalRow] { +val combinedRow = new JoinedRow() + +//position column will always be at the end of the row +private def getPos(row: InternalRow): Long = { + row.getLong(row.numFields-1) +} + +private def getNextSkeleton: (InternalRow, Long) = { + val nextSkeletonRow = skeletonFileIterator.next().asInstanceOf[InternalRow] + (nextSkeletonRow, getPos(nextSkeletonRow)) +} + +private def getNextData: (InternalRow, Long) = { + val nextSkeletonRow = skeletonFileIterator.next().asInstanceOf[InternalRow] + (nextSkeletonRow, getPos(nextSkeletonRow)) +} + +override def close(): Unit = { + skeletonFileIterator.close() + dataFileIterator.close() +} + +override protected def doHasNext(): Boolean = { + if (!dataFileIterator.hasNext || !skeletonFileIterator.hasNext) { +false + } else { +var nextSkeleton = getNextSkeleton +var nextData = getNextData +while (nextSkeleton._2 != nextData._2) { + if (nextSkeleton._2 > nextData._2) { +if (!dataFileIterator.hasNext) { + return false +} else { + nextData = getNextData +} } else { -vecs(i) = d.column(i - s.numCols()) +if (!skeletonFileIterator.hasNext) { + retu
Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions [hudi]
hudi-bot commented on PR #11398: URL: https://github.com/apache/hudi/pull/11398#issuecomment-2150703254 ## CI report: * da8e1320dc7b7e18a35319a32342f96eff646518 UNKNOWN * 723b5a29eb4a7f872bb4436f8d6c612edf97a4d4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions bootstrap [hudi]
hudi-bot commented on PR #11399: URL: https://github.com/apache/hudi/pull/11399#issuecomment-2150703293 ## CI report: * 2e1f5f9da800d048b39b5f119038191b9f277396 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24235) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]
hudi-bot commented on PR #11162: URL: https://github.com/apache/hudi/pull/11162#issuecomment-2150702446 ## CI report: * b342d8f8e10f77419bf1bd0bc9f626a596ad65f9 UNKNOWN * 7a0a21f67d6cfc5a17cd1e04abec99dfb6fd53f1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24232) * 8a9986ae4b8712c0e2e700aeb40a1e4c041fde0e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions bootstrap [hudi]
hudi-bot commented on PR #11399: URL: https://github.com/apache/hudi/pull/11399#issuecomment-2150688904 ## CI report: * 2e1f5f9da800d048b39b5f119038191b9f277396 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions [hudi]
hudi-bot commented on PR #11398: URL: https://github.com/apache/hudi/pull/11398#issuecomment-2150688823 ## CI report: * da8e1320dc7b7e18a35319a32342f96eff646518 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]
hudi-bot commented on PR #11162: URL: https://github.com/apache/hudi/pull/11162#issuecomment-2150688074 ## CI report: * b342d8f8e10f77419bf1bd0bc9f626a596ad65f9 UNKNOWN * c6d07ea56ebf1c7eaeb9306df8fe0dd366d72abe Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24217) * 7a0a21f67d6cfc5a17cd1e04abec99dfb6fd53f1 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-7829) storage partition stats index can not effert in data skipping
[ https://issues.apache.org/jira/browse/HUDI-7829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852528#comment-17852528 ] Sagar Sumit commented on HUDI-7829: --- Thanks for creating the issue. Will take a look. > storage partition stats index can not effert in data skipping > - > > Key: HUDI-7829 > URL: https://issues.apache.org/jira/browse/HUDI-7829 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: KnightChess >Priority: Major > Attachments: image-2024-06-05-16-30-50-503.png, > image-2024-06-05-16-31-44-871.png, image-2024-06-05-16-32-02-293.png > > > partition stats will not effort, the current implementation does not seem to > achieve the effect of partition filtering. > - first > in this picture, I change the ut filter to trigger partition stats index. > !image-2024-06-05-16-30-50-503.png! > partition_stats will not save fileName, so if reuse `CSI` logical, it will > throw null point in group by key > !image-2024-06-05-16-31-44-871.png! > !image-2024-06-05-16-32-02-293.png! > and this will cause skip other index > * second > and have a question, I am not sure this pr is use to `partition` purge like > physical partition col, which mean use other field min/max to get which > physical partitions to list fileSlice. or filter fileName like `CSI`, `RLI`. > thanks -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7829) storage partition stats index can not effert in data skipping
[ https://issues.apache.org/jira/browse/HUDI-7829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit reassigned HUDI-7829: - Assignee: Sagar Sumit > storage partition stats index can not effert in data skipping > - > > Key: HUDI-7829 > URL: https://issues.apache.org/jira/browse/HUDI-7829 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: KnightChess >Assignee: Sagar Sumit >Priority: Major > Attachments: image-2024-06-05-16-30-50-503.png, > image-2024-06-05-16-31-44-871.png, image-2024-06-05-16-32-02-293.png > > > partition stats will not effort, the current implementation does not seem to > achieve the effect of partition filtering. > - first > in this picture, I change the ut filter to trigger partition stats index. > !image-2024-06-05-16-30-50-503.png! > partition_stats will not save fileName, so if reuse `CSI` logical, it will > throw null point in group by key > !image-2024-06-05-16-31-44-871.png! > !image-2024-06-05-16-32-02-293.png! > and this will cause skip other index > * second > and have a question, I am not sure this pr is use to `partition` purge like > physical partition col, which mean use other field min/max to get which > physical partitions to list fileSlice. or filter fileName like `CSI`, `RLI`. > thanks -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7829) storage partition stats index can not effert in data skipping
[ https://issues.apache.org/jira/browse/HUDI-7829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7829: -- Fix Version/s: 1.0.0 > storage partition stats index can not effert in data skipping > - > > Key: HUDI-7829 > URL: https://issues.apache.org/jira/browse/HUDI-7829 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: KnightChess >Assignee: Sagar Sumit >Priority: Major > Fix For: 1.0.0 > > Attachments: image-2024-06-05-16-30-50-503.png, > image-2024-06-05-16-31-44-871.png, image-2024-06-05-16-32-02-293.png > > > partition stats will not effort, the current implementation does not seem to > achieve the effect of partition filtering. > - first > in this picture, I change the ut filter to trigger partition stats index. > !image-2024-06-05-16-30-50-503.png! > partition_stats will not save fileName, so if reuse `CSI` logical, it will > throw null point in group by key > !image-2024-06-05-16-31-44-871.png! > !image-2024-06-05-16-32-02-293.png! > and this will cause skip other index > * second > and have a question, I am not sure this pr is use to `partition` purge like > physical partition col, which mean use other field min/max to get which > physical partitions to list fileSlice. or filter fileName like `CSI`, `RLI`. > thanks -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]
codope commented on code in PR #11162: URL: https://github.com/apache/hudi/pull/11162#discussion_r1628207181 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataLogRecordReader.java: ## @@ -253,7 +253,7 @@ public HoodieMetadataLogRecordReader build() { } private boolean shouldUseMetadataMergedLogRecordScanner() { - return PARTITION_NAME_SECONDARY_INDEX.equals(partitionName); + return partitionName.startsWith(PARTITION_NAME_SECONDARY_INDEX_PREFIX); Review Comment: note: this is main fix in the latest commit. The issue was not caught in the testing previously because I was only asserting count of records. I have improved the test. Now, we assert the secondary index records as well as check that file pruning happens. Please see `TestSecondaryIndexPruning` ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/SecondaryIndexTestBase.scala: ## @@ -62,4 +69,54 @@ class SecondaryIndexTestBase extends HoodieSparkClientTestBase { cleanupResources() } + def verifyQueryPredicate(hudiOpts: Map[String, String], columnName: String): Unit = { +mergedDfList = spark.read.format("hudi").options(hudiOpts).load(basePath).repartition(1).cache() :: mergedDfList +val secondaryKey = mergedDfList.last.limit(1).collect().map(row => row.getAs(columnName).toString) +val dataFilter = EqualTo(attribute(columnName), Literal(secondaryKey(0))) +verifyFilePruning(hudiOpts, dataFilter) + } + + private def attribute(partition: String): AttributeReference = { +AttributeReference(partition, StringType, nullable = true)() + } + + + private def verifyFilePruning(opts: Map[String, String], dataFilter: Expression): Unit = { Review Comment: note: the method here and below just help with verifying file pruning happens correctly i.e. with data skipping enabled, filtered file count < data files in latest snapshot, and with data skipping disabled they should be equal. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]
codope commented on code in PR #11162: URL: https://github.com/apache/hudi/pull/11162#discussion_r1628207181 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataLogRecordReader.java: ## @@ -253,7 +253,7 @@ public HoodieMetadataLogRecordReader build() { } private boolean shouldUseMetadataMergedLogRecordScanner() { - return PARTITION_NAME_SECONDARY_INDEX.equals(partitionName); + return partitionName.startsWith(PARTITION_NAME_SECONDARY_INDEX_PREFIX); Review Comment: note to reviewer: this is amin fix in the latest commit. The issue was not caught in the testing previously because I was only asserting count of records. I have improved the test. Now, we assert the secondary index records as well as check that file pruning happens. Please see `TestSecondaryIndexPruning` ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/SecondaryIndexTestBase.scala: ## @@ -62,4 +69,54 @@ class SecondaryIndexTestBase extends HoodieSparkClientTestBase { cleanupResources() } + def verifyQueryPredicate(hudiOpts: Map[String, String], columnName: String): Unit = { +mergedDfList = spark.read.format("hudi").options(hudiOpts).load(basePath).repartition(1).cache() :: mergedDfList +val secondaryKey = mergedDfList.last.limit(1).collect().map(row => row.getAs(columnName).toString) +val dataFilter = EqualTo(attribute(columnName), Literal(secondaryKey(0))) +verifyFilePruning(hudiOpts, dataFilter) + } + + private def attribute(partition: String): AttributeReference = { +AttributeReference(partition, StringType, nullable = true)() + } + + + private def verifyFilePruning(opts: Map[String, String], dataFilter: Expression): Unit = { Review Comment: note to reviewer: the method here and below just help with verifying file pruning happens correctly i.e. with data skipping enabled, filtered file count < data files in latest snapshot, and with data skipping disabled they should be equal. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]
codope commented on code in PR #11162: URL: https://github.com/apache/hudi/pull/11162#discussion_r1628193853 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java: ## @@ -229,6 +231,13 @@ public void cancelAllJobs() { javaSparkContext.cancelAllJobs(); } + @Override + public O aggregate(HoodieData data, O zeroValue, Functions.Function2 seqOp, Functions.Function2 combOp) { +Function2 seqOpFunc = seqOp::apply; +Function2 combOpFunc = combOp::apply; +return HoodieJavaRDD.getJavaRDD(data).aggregate(zeroValue, seqOpFunc, combOpFunc); Review Comment: Please check latest commit. I have done the changes and added a test where we create secondary index for a field of long type. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]
codope commented on code in PR #11162: URL: https://github.com/apache/hudi/pull/11162#discussion_r1628190224 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSecondaryIndexWithSql.scala: ## @@ -95,4 +97,39 @@ class TestSecondaryIndexWithSql extends SecondaryIndexTestBase { private def checkAnswer(sql: String)(expects: Seq[Any]*): Unit = { assertResult(expects.map(row => Row(row: _*)).toArray.sortBy(_.toString()))(spark.sql(sql).collect().sortBy(_.toString())) } + + @Test + def testSecondaryIndexWithInFilter(): Unit = { +if (HoodieSparkUtils.gteqSpark3_2) { + var hudiOpts = commonOpts + hudiOpts = hudiOpts + ( +DataSourceWriteOptions.TABLE_TYPE.key -> HoodieTableType.COPY_ON_WRITE.name(), +DataSourceReadOptions.ENABLE_DATA_SKIPPING.key -> "true") + + spark.sql( +s""" + |create table $tableName ( + | record_key_col string, + | not_record_key_col string, + | partition_key_col string + |) using hudi + | options ( + | primaryKey ='record_key_col', + | hoodie.metadata.enable = 'true', + | hoodie.metadata.record.index.enable = 'true', + | hoodie.datasource.write.recordkey.field = 'record_key_col', + | hoodie.enable.data.skipping = 'true' + | ) + | partitioned by(partition_key_col) + | location '$basePath' + """.stripMargin) + spark.sql(s"insert into $tableName values('row1', 'abc', 'p1')") Review Comment: Fixed the issue in the latest commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions bootstrap [hudi]
jonvex opened a new pull request, #11399: URL: https://github.com/apache/hudi/pull/11399 ### Change Logs Testing hive 3 bootstrap read using the bundle validation setup ### Impact see if hive 3 works as expected ### Risk level (write none, low medium or high below) none ### Documentation Update N/A ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] using spark's observe feature on dataframes saved by hudi is stuck [hudi]
szingerpeter commented on issue #11367: URL: https://github.com/apache/hudi/issues/11367#issuecomment-2150610226 @ad1happy2go , thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-6787] DO NOT MERGE. Test hive3 in ghactions [hudi]
jonvex opened a new pull request, #11398: URL: https://github.com/apache/hudi/pull/11398 ### Change Logs Testing hive 3 using the bundle validation setup ### Impact see if hive 3 works as expected ### Risk level (write none, low medium or high below) none ### Documentation Update N/A ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7831) Support secondary index reads using native HFile reader
Sagar Sumit created HUDI-7831: - Summary: Support secondary index reads using native HFile reader Key: HUDI-7831 URL: https://issues.apache.org/jira/browse/HUDI-7831 Project: Apache Hudi Issue Type: Improvement Reporter: Sagar Sumit Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]
wombatu-kun commented on PR #11385: URL: https://github.com/apache/hudi/pull/11385#issuecomment-2150016743 > Let me know if you prefer to address the `toString()` calls in this PR. Also, could you raise another PR against `branch-0.x` with the same changes? hi, @yihua ! I've made it in separate commit, review it please. if it is enough - i'll raise PR against `branch-0.x`. Also, let me know if it's better to squash all changes to a single commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]
hudi-bot commented on PR #11385: URL: https://github.com/apache/hudi/pull/11385#issuecomment-2149867032 ## CI report: * 0b9134e14a349ac70defc972dd67e464c0506ae1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24230) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Unable to Use DynamoDB Based Lock with Hudi PySpark Job Locally [hudi]
soumilshah1995 commented on issue #11391: URL: https://github.com/apache/hudi/issues/11391#issuecomment-2149868601 Added following packages ``` HUDI_VERSION = '0.14.0' SPARK_VERSION = '3.4' os.environ["JAVA_HOME"] = "/opt/homebrew/opt/openjdk@11" SUBMIT_ARGS = f"--packages org.apache.hudi:hudi-spark{SPARK_VERSION}-bundle_2.12:{HUDI_VERSION},com.amazonaws:dynamodb-lock-client:1.2.0,com.amazonaws:aws-java-sdk-dynamodb:1.12.735,com.amazonaws:aws-java-sdk-core:1.12.735,org.apache.hudi:hudi-aws-bundle:{HUDI_VERSION},org.apache.hudi:hudi-aws:{HUDI_VERSION} pyspark-shell" os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS os.environ['PYSPARK_PYTHON'] = sys.executable spark = SparkSession.builder \ .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \ .config('spark.sql.extensions', 'org.apache.spark.sql.hudi.HoodieSparkSessionExtension') \ .config('className', 'org.apache.hudi') \ .config('spark.sql.hive.convertMetastoreParquet', 'false') \ .getOrCreate() ``` # Error : org.apache.hudi.exception.HoodieException: Unable to instantiate class org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider ``` g.apache.hudi#hudi-aws-bundle added as a dependency org.apache.hudi#hudi-aws added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-aa8d9c29-7056-4201-b20a-c5f73fac7ea9;1.0 confs: [default] found org.apache.hudi#hudi-spark3.4-bundle_2.12;0.14.0 in spark-list found com.amazonaws#dynamodb-lock-client;1.2.0 in central found software.amazon.awssdk#dynamodb;2.20.8 in central found software.amazon.awssdk#aws-json-protocol;2.20.8 in central found software.amazon.awssdk#aws-core;2.20.8 in central found software.amazon.awssdk#annotations;2.20.8 in central found software.amazon.awssdk#regions;2.20.8 in central found software.amazon.awssdk#utils;2.20.8 in central found org.reactivestreams#reactive-streams;1.0.2 in central found org.slf4j#slf4j-api;1.7.30 in local-m2-cache found software.amazon.awssdk#sdk-core;2.20.8 in central found software.amazon.awssdk#http-client-spi;2.20.8 in central found software.amazon.awssdk#metrics-spi;2.20.8 in central found software.amazon.awssdk#endpoints-spi;2.20.8 in central found software.amazon.awssdk#profiles;2.20.8 in central found software.amazon.awssdk#json-utils;2.20.8 in central found software.amazon.awssdk#third-party-jackson-core;2.20.8 in central found software.amazon.awssdk#auth;2.20.8 in central found software.amazon.eventstream#eventstream;1.0.1 in central found software.amazon.awssdk#protocol-core;2.20.8 in central found software.amazon.awssdk#apache-client;2.20.8 in central found org.apache.httpcomponents#httpclient;4.5.13 in local-m2-cache found org.apache.httpcomponents#httpcore;4.4.13 in local-m2-cache found commons-logging#commons-logging;1.2 in local-m2-cache found software.amazon.awssdk#netty-nio-client;2.20.8 in central found io.netty#netty-codec-http2;4.1.86.Final in central found io.netty#netty-common;4.1.86.Final in central found io.netty#netty-buffer;4.1.86.Final in central found io.netty#netty-transport;4.1.86.Final in central found io.netty#netty-resolver;4.1.86.Final in central found io.netty#netty-codec;4.1.86.Final in central found io.netty#netty-transport-classes-epoll;4.1.86.Final in central found io.netty#netty-transport-native-unix-common;4.1.86.Final in central found com.amazonaws#aws-java-sdk-dynamodb;1.12.735 in central found com.amazonaws#aws-java-sdk-s3;1.12.735 in central found com.amazonaws#aws-java-sdk-kms;1.12.735 in central found com.amazonaws#aws-java-sdk-core;1.12.735 in central found commons-codec#commons-codec;1.15 in local-m2-cache found com.fasterxml.jackson.core#jackson-databind;2.12.7.2 in central found com.fasterxml.jackson.core#jackson-annotations;2.12.7 in local-m2-cache found com.fasterxml.jackson.core#jackson-core;2.12.7 in local-m2-cache found com.fasterxml.jackson.dataformat#jackson-dataformat-cbor;2.12.6 in central found joda-time#joda-time;2.12.7 in central found com.amazonaws#jmespath-java;1.12.735 in central found org.apache.hudi#hudi-aws-bundle;0.14.0 in central found org.apache.hudi#hudi-common;0.14.0 in central found org.openjdk.jol#jol-core;0.16 in local-m2-cache found com.fasterxml.jackson.datatype#jackson-datatype-jsr310;2.10.0 in local-m2-cache found com.github.ben-manes.caffeine#caffeine;2.9.1 in local-m2-cache found org.checkerframework#checker-qual;3.10.0 in local-m2-cache found com.google.errorprone#error_prone_annotati
Re: [PR] [HUDI-7830] Add predicate filter pruning for snapshot queries in hudi related sources [hudi]
hudi-bot commented on PR #11396: URL: https://github.com/apache/hudi/pull/11396#issuecomment-2149867215 ## CI report: * d39943c1608d0a18e25e8b13f9bf6900c684253f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24231) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Unable to Use DynamoDB Based Lock with Hudi PySpark Job Locally [hudi]
soumilshah1995 commented on issue #11391: URL: https://github.com/apache/hudi/issues/11391#issuecomment-2149831661 # Code ``` HUDI_VERSION = '0.14.0' SPARK_VERSION = '3.4' os.environ["JAVA_HOME"] = "/opt/homebrew/opt/openjdk@11" AWS_JAR_FILES = f"org.apache.hudi:hudi-aws:{HUDI_VERSION},org.apache.hudi:hudi-aws-bundle:{HUDI_VERSION}" SUBMIT_ARGS = f"--packages org.apache.hudi:hudi-spark3.4.1-bundle_2.12:{HUDI_VERSION},{AWS_JAR_FILES} pyspark-shell" os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS os.environ['PYSPARK_PYTHON'] = sys.executable spark = SparkSession.builder \ .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \ .config('spark.sql.extensions', 'org.apache.spark.sql.hudi.HoodieSparkSessionExtension') \ .config('className', 'org.apache.hudi') \ .config('spark.sql.hive.convertMetastoreParquet', 'false') \ .getOrCreate() ``` # Error ``` python3 w1.py Imports loaded successfully. Warning: Ignoring non-Spark config property: className :: loading settings :: url = jar:file:/opt/anaconda3/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml Ivy Default Cache set to: /Users/soumilshah/.ivy2/cache The jars for the packages stored in: /Users/soumilshah/.ivy2/jars org.apache.hudi#hudi-spark3.4.1-bundle_2.12 added as a dependency org.apache.hudi#hudi-aws added as a dependency org.apache.hudi#hudi-aws-bundle added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-9c6c8274-f28f-4a73-b9e9-c27219acefce;1.0 confs: [default] found org.apache.hudi#hudi-aws;0.14.0 in central found org.apache.hudi#hudi-common;0.14.0 in central found org.openjdk.jol#jol-core;0.16 in local-m2-cache found com.fasterxml.jackson.core#jackson-annotations;2.10.0 in local-m2-cache found com.fasterxml.jackson.core#jackson-databind;2.10.0 in local-m2-cache found com.fasterxml.jackson.core#jackson-core;2.10.0 in local-m2-cache found com.fasterxml.jackson.datatype#jackson-datatype-jsr310;2.10.0 in local-m2-cache found com.github.ben-manes.caffeine#caffeine;2.9.1 in local-m2-cache found org.checkerframework#checker-qual;3.10.0 in local-m2-cache found com.google.errorprone#error_prone_annotations;2.5.1 in local-m2-cache found org.apache.orc#orc-core;1.6.0 in local-m2-cache found org.apache.orc#orc-shims;1.6.0 in local-m2-cache found org.slf4j#slf4j-api;1.7.36 in local-m2-cache found com.google.protobuf#protobuf-java;3.21.7 in local-m2-cache found commons-lang#commons-lang;2.6 in local-m2-cache found io.airlift#aircompressor;0.15 in local-m2-cache found javax.xml.bind#jaxb-api;2.2.11 in local-m2-cache found org.apache.hive#hive-storage-api;2.6.0 in local-m2-cache found org.jetbrains#annotations;17.0.0 in local-m2-cache found org.roaringbitmap#RoaringBitmap;0.9.47 in local-m2-cache found org.apache.httpcomponents#fluent-hc;4.4.1 in local-m2-cache found commons-logging#commons-logging;1.2 in local-m2-cache found org.rocksdb#rocksdbjni;7.5.3 in local-m2-cache found org.apache.hbase#hbase-client;2.4.9 in local-m2-cache found org.apache.hbase.thirdparty#hbase-shaded-protobuf;3.5.1 in local-m2-cache found org.apache.hbase#hbase-protocol-shaded;2.4.9 in local-m2-cache found org.apache.yetus#audience-annotations;0.5.0 in local-m2-cache found org.apache.hbase#hbase-protocol;2.4.9 in local-m2-cache found javax.annotation#javax.annotation-api;1.2 in local-m2-cache found commons-codec#commons-codec;1.13 in local-m2-cache found commons-io#commons-io;2.11.0 in local-m2-cache found org.apache.commons#commons-lang3;3.9 in local-m2-cache found org.apache.hbase.thirdparty#hbase-shaded-miscellaneous;3.5.1 in local-m2-cache found com.google.errorprone#error_prone_annotations;2.7.1 in local-m2-cache found org.apache.hbase.thirdparty#hbase-shaded-netty;3.5.1 in local-m2-cache found org.apache.zookeeper#zookeeper;3.5.7 in local-m2-cache found org.apache.zookeeper#zookeeper-jute;3.5.7 in local-m2-cache found io.netty#netty-handler;4.1.45.Final in local-m2-cache found io.netty#netty-common;4.1.45.Final in local-m2-cache found io.netty#netty-buffer;4.1.45.Final in local-m2-cache found io.netty#netty-transport;4.1.45.Final in local-m2-cache found io.netty#netty-resolver;4.1.45.Final in local-m2-cache found io.netty#netty-codec;4.1.45.Final in local-m2-cache found io.netty#netty-transport-native-epoll;4.1.45.Final in local-m2-cache found io.netty#netty-transport-native-unix-common;4.1.45.Final in local-m2-cache found org.apache.htrace#htrace-core4;4.2.0-i
Re: [I] [SUPPORT] Serde properties missing after migrate from hivesync to gluesync [hudi]
prathit06 commented on issue #11397: URL: https://github.com/apache/hudi/issues/11397#issuecomment-2149827641 I have fixed this for our internal use & would like to contribute the same. Kindly access & let me know if any other information is required on the same. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Serde properties missing after migrate from hivesync to gluesync [hudi]
prathit06 opened a new issue, #11397: URL: https://github.com/apache/hudi/issues/11397 **Describe the problem you faced** - We used hive sync to sync tables to glue for hudi version 0.8, 0.10.0, 0.11.1. After sometime we started using glue sync in hudi version 0.11.1 & have recently migrated our workload to 0.13.1. - After migration to 0.13.1 we have started facing errors wherein serde properties are missing in table DDL & when we try to read table using spark we get below error ```org.apache.hudi.exception.HoodieException: 'path' or 'Key: 'hoodie.datasource.read.paths' , default: null description: Comma separated list of file paths to read within a Hudi table. since version: version is not defined deprecated after: version is not defined)' or both must be specified.``` A clear and concise description of the problem. - Not able to read hudi table from spark due to missing serDe properties after we migrated to 0.13.1 from 0.11.1 & changed from hive sync to glue sync **To Reproduce** - Create a table using hudi 0.8 using hive sync, upgrade hudi version to 0.10, upgrade to 0.11.1, add a new column & sync using hive sync. - Add a new column & sync the table using glue sync - Update to 0.13.1, add a new column & sync the table - Check table DDL & serde properties should be missing from the create DDL when checked on spark **Expected behaviour** Expected behaviour is for serde properties to be present so spark can read the hudi table **Environment Description** * Hudi version : 0.13.1 * Spark version : 3.1.2 * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Stacktrace** ```Add the stacktrace of the error.``` ```24/05/30 02:41:24 ERROR DataSync: Got error in executing Data Sync job org.apache.hudi.exception.HoodieException: 'path' or 'Key: 'hoodie.datasource.read.paths' , default: null description: Comma separated list of file paths to read within a Hudi table. since version: version is not defined deprecated after: version is not defined)' or both must be specified. at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:77) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:353) at org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anon$1.call(DataSourceStrategy.scala:270) at org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anon$1.call(DataSourceStrategy.scala:256) at org.apache.spark.sql.execution.datasources.FindDataSourceTable.org$apache$spark$sql$execution$datasources$FindDataSourceTable$$readDataSourceTable(DataSourceStrategy.scala:275) at org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:325) at org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:311) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:75) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$4(AnalysisHelper.scala:113) at org.apache.spark.sql.catalyst.trees.TreeNode.applyFunctionIfChanged$1(TreeNode.scala:388) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:424) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:256) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:422) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:370) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:113) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scal
Re: [I] [SUPPORT] Unable to Use DynamoDB Based Lock with Hudi PySpark Job Locally [hudi]
soumilshah1995 closed issue #11391: [SUPPORT] Unable to Use DynamoDB Based Lock with Hudi PySpark Job Locally URL: https://github.com/apache/hudi/issues/11391 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Unable to Use DynamoDB Based Lock with Hudi PySpark Job Locally [hudi]
soumilshah1995 commented on issue #11391: URL: https://github.com/apache/hudi/issues/11391#issuecomment-2149620227 oh let me try this and update the thread shortly -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7830] Add predicate filter pruning for snapshot queries in hudi related sources [hudi]
hudi-bot commented on PR #11396: URL: https://github.com/apache/hudi/pull/11396#issuecomment-2149598890 ## CI report: * 5dc3a94d9c3acb593b0c993e7ffa3b415e917774 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24229) * d39943c1608d0a18e25e8b13f9bf6900c684253f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24231) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]
hudi-bot commented on PR #11385: URL: https://github.com/apache/hudi/pull/11385#issuecomment-2149598779 ## CI report: * 064b5310f709e5886dd7e278d1ebf9cdcfbe70c7 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24206) * 0b9134e14a349ac70defc972dd67e464c0506ae1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24230) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]
hudi-bot commented on PR #11385: URL: https://github.com/apache/hudi/pull/11385#issuecomment-2149582706 ## CI report: * 064b5310f709e5886dd7e278d1ebf9cdcfbe70c7 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24206) * 0b9134e14a349ac70defc972dd67e464c0506ae1 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7830] Add predicate filter pruning for snapshot queries in hudi related sources [hudi]
hudi-bot commented on PR #11396: URL: https://github.com/apache/hudi/pull/11396#issuecomment-2149582843 ## CI report: * 5dc3a94d9c3acb593b0c993e7ffa3b415e917774 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24229) * d39943c1608d0a18e25e8b13f9bf6900c684253f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org